Abstract-Process, voltage and temperature variations are on the rise with technology scaling. Nano-scale technology requires huge design margins to ensure reliable operation. Worst case design margining consumes significant amount of circuits and systems resources. In-situ error detection or correction is an alternative method for cost effective variation tolerance. However, existing in-situ error detection and correction circuits are power and area hungry since they use speculative error management, which gives less power savings at higher error rates. This paper proposes an error resilience technique utilizing available slack in the design. The proposed method uses a clock stretching circuit to relax timing margins on selected critical paths that has sufficient consecutive stage slack. We also propose a power optimization method which reshapes the critical path logic proportionate to the consecutive stage slack. Experimental results show that the proposed method achieves the power and area savings of 40% and 8% respectively compared to the worst case design approach. When compared to the TIMBER error resilience approach, the proposed method saves power more than 74% and area more than 13% at design time.
INTRODUCTION
Process inaccuracies are getting worse with technology scaling [1] , [2] . Dynamic variations are on the rise with more devices packed into one chip. We need huge design margins in nanometer technology nodes. Wearables and Internet of Things (IOT) demand energy efficient design. They confine the devices to operate in sub threshold or near threshold regions. Delay variations tend to be exponential in these regions which increase design margins [3] . Margins are inserted at the expense of chip resources. If we use traditional Worst Case Design (WCD) margins, the overheads will be huge. We need new Variation Tolerant Design (VTD) techniques [4] that improve chip reliability with less delay, power and area overheads as shown in Fig. 1 .
In-situ error detection and correction is an alternative to WCD which uses error detection flip-flops and error correction architecture in the design. This allow Better than Worst Case (BTWC) design [5] , [6] to tune the operating point of the chip. The efficiency of these error resilience techniques depends on the number of resilient elements used, circuit level complexity, activation rate of critical paths and architectural overheads for error correction and/or error propagation. These factors should be optimized so that the overheads will not exceed the savings obtained. Fig. 2 shows the typical blocks in an error resilient architecture and the overheads involved. A normal flip-flop has master latch, slave latch and a clock buffer with a total transistor count approximately equal to 24. A typical error resilient architecture [7] , [18] adds another 42 transistors into the circuit which results in considerable area and power overhead as shown in Table I .
In this paper, we propose deterministic slack analysis to find available slack in a processor pipeline. A clock control circuit is then used to stretch the clock of selected critical path flip-flops which have sufficient slack in the consecutive stages. This allows the critical stages to borrow time from the consecutive stage without any circuit level redundancy overheads. The method is non-speculative which removes architectural overheads in the form of error generation and propagation. The slack available in the consecutive stages is used to reshape the critical path combinational logic. This results in considerable power and area savings at design time. Error generation logic is optional unless we want to do a closed loop voltage or frequency scaling.
The key contributions of this paper are as follows:
1)
We demonstrate a methodology to improve design margin of critical paths by selective clock stretching. This reduces power and area overheads compared to other error resilience techniques.
2) We also propose a power optmization strategy by reshaping the critical paths with sufficient consecutive slack. This results in considerable power and area savings for the resilient design. The subsequent sections of this paper are organized as follows. Section II describes the related works in circuit, architecture and algorithmic level. Section III explains the motivation of the proposed approach. Section IV discusses the proposed design methodology and power optimization method. Finally, section V presents simulation results and section VI draws the conclusions.
II. STATE OF THE ART METHODOLOGIES
Variations are often manifested as timing errors in a typical processor pipeline. Variation tolerant design replaces critical paths with suitable error detection elements. They use a redundant logic to sample the input data at a delayed clock. Error detection circuits compare the data input with the delayed sample and generates an error signal. The error signal generated is used for error correction at the circuit or architectural level. They have huge power, area and latency overheads which makes it less attractive. Razor I [10] , Bubble Razor [11] , Razor II [12] , DSTB and TDTB [13] belong to this category. Meta-stability and short path overheads are also common in error resilient circuits. Error masking circuits masks the timing errors using redundant logic. They too have huge power and area overheads at circuit level. More over the speculative time borrowing leads to huge error propagation overheads. TIMBER flip-flops [7] and Soft Edges flip-flops [8] belong to this category. Most of the digital designs have the critical wall of slack behavior [9] . This leads to a rapid surge in critical paths with dynamic operating point tuning which makes the cost of resilience higher. EVAL (Environment for Variation Afflicted Logic) [14] uses adaptive body bias and supply voltage knobs to speed up slower paths and slow down faster paths. Blue Shift [15] tunes the timing constraints and bias voltage to optimize selected critical paths. Power aware slack distribution (SlackOptimizer) [16] use cell sizing to distribute slack evenly in a power and cost efficient manner.
Selective End Point Optimization (SEOpt), Clock Skew
Optimization (SkewOpt) and Combined Optimization (CombOpt) [17] reduce the cost of resilience by replacing error tolerant registers with conventional ones using additional margin insertion. The level of robustness achieved using these methods still lag behind the computational complexity and the overheads involved.
III. MOTIVATION
Slack analysis experiments were done on a 40nm industrial processor with a three stage fetch, decode and executive pipeline stages. The processor core has a complexity of 26K logic gates and 7000 flip-flops. We chose the most critical endpoints with slack less than 2% of the clock period. Fig. 3 shows the critical path pipe_1 slacks in fetch, decode and execute stages and their consecutive stage pipe_2 slacks. Results show that 85% of the critical paths have sufficient slack in the consecutive stage. There is a mean slack improvement of 42X on the critical paths if they borrow slack from the consecutive stage. This deterministic slack analysis approach offloads the speculative overheads associated with a typical error resilience technique like TIMBER [7] . The selected critical paths can use a clock stretching circuit to borrow a time proportionate to the consecutive slack. The design margin improvement along the critical paths helps the pipeline to tolerate more process variations. This allows us to reshape the combinational logic along most of the critical paths to get power and area savings.
IV. PROPOSED DESIGN METHODOLOGY RESILIENCE BY CLOCK STRETCHING
In this section, we will explain the proposed clock stretching scheme to relax the design margins on the critical paths. Fig. 4 shows the clock stretching circuit along with a master slave flip-flop. The design is similar to master slave flip-flop except for the clock control signal P. The stretched clock P allows the data input to change beyond the positive clock edge. P is derived from the original clock CK and the delayed version of the clock DCK. During the low phase of P, transmission gate TG0 is open whereas TG1 is closed and master latch L0 samples the input data. During the low to high transition of P, transmission gate TG1 is open and TG0 closed and the shadow latch L1 sample the data to output. There is a transparency window for the master latch L0 compared to the normal master slave flip-flops during which any delayed input is also sampled to the intermediate node. This window is decided by the delay clock DCK. We fix the delay time for DCK based on the slack analysis results. In our experiments we use a value of Tck/8, T ck /4, 3T ck /8 and T ck /2 for DCK delay proportionate to the consecutive slack, where T ck is the clock period. There is no error speculation because the design margin relaxed at each critical path is fixed during the design time based on the slack available in the consecutive stage. For dynamic voltage/frequency scaling a transition detector can be used for error generation which is optional. There is no metastability issue as the input is not sampled very close to the clock transition. There will be hold time issues due to the delayed sampling of input data. This can be rectified by adding delay buffers in the corresponding short paths. This approach has less overhead compared to the TIMBER circuit shown in Fig. 5 . TIMBER overheads include redundancy, error generation logic and clock control signal generation. Also the speculative nature of error resilience necessitates an error propagation logic in TIMBER.
A. Power optimization by reshaping
We use custom scripts for slack analysis that is plugged into the existing CAD tools. The critical paths under consideration are evaluated for consecutive stage slack. Fig.  6(a) shows the case where there is no consecutive slack. Here we do not modify the pipeline stage. Figure 6(b) shows the case where there is sufficient slack in second pipeline stage. So we use the stretched clock for the critical end point register. This enables the critical stage to borrow a time TB proportionate to slack2. The pipeline stage 1 design margin is relaxed by TB. So we reshape the critical path combinational logic proportionate to the time borrowed. This downsizes the critical path logic and results in power and area savings. Fig. 7 shows the pseudo code for the slack analysis and critical path reshaping. P represents the critical paths under consideration and S represents the corresponding consecutive slacks. For each critical path in P, the clock stretching values are fixed from TB1=Tck/8 to TB4=T ck /2. The critical path is also reshaped by the same amount. For reshaping, we relax the timing margin on selected paths and resynthesize the logic so as to get power and area savings. The critical end-points tend to have the same range of consecutive slack. This helps to cluster the endpoints with the same pipe2 slack and use a common clock control for them.
V. RESULTS AND ANALYSIS

A. Flip-flop level savings
We use a simplified flip-flop structure to balance the slack between different pipeline stages. Compared to the reference TIMBER flip-flop, the footprint of the circuit is similar to a standard master slave flip-flop which makes it easy for the CAD tools to use the flip-flops in the design flow. An area reduction of 60% compared to TIMBER is attained per flip-flop for the proposed scheme. Table II shows the flip-flop level savings such as C to Q delay, minimum power and maximum power compared to TIMBER. We get a rising C to Q delay reduction of 18.21%, falling C to Q delay reduction of 13.83%, maximum power saving of 23% and minimum power saving of 25.6% compared to reference TIMBER flip-flop.
B. Chip level savings
The proposed methodology is used to design the critical pipeline stages of an industrial processor. Power and area comparisons are being done against a baseline design with normal flip-flops and with a TIMBER based error resilient architecture. Table III shows the power comparison results between baseline, TIMBER and proposed scheme in the fetch, decode and execute pipeline stages. The fetch stage has 39 critical paths with sufficient consecutive slack, decode stage has no paths to be reinforced and execute stage has 358 paths with sufficient slack. Table IV shows the area comparison between baseline, TIMBER and proposed scheme. Slack analysis group the critical flip-flops based on the slack available in the next stage. Fig. 8 (a) and (b) shows the power and area comparisons between the baseline, TIMBER and the proposed scheme. Power comparison results show a 36% power overhead for TIMBER and 34% power savings for the proposed scheme in the fetch stage against the baseline design.
In the execute pipeline stage, TIMBER has a power overhead equal to 2X the baseline whereas the proposed scheme gives a power saving of 46%. Area comparisons in the fetch stage show an area overhead of 2% for TIMBER and a saving of 13% for the proposed scheme against the baseline. In the execute stage, TIMBER has an area overhead of 7% whereas the proposed method gives an area saving of 6% compared to baseline. The critical paths with less slack in the consecutive stage were found to be less active paths. Thus reinforcing 85% of the active critical paths using the proposed scheme will improve the overall robustness of the chip. The mean slack improvement of 42X compared to baseline design will equip the chip to tolerate more delay variations within the improved slack window. We use custom scripts to analyze the critical path slacks and the activity factors along them. The gain in power and area coupled with improved delay variation tolerance far exceed the computational complexity involved.
VI. CONCLUSIONS
The proposed methodology is able to retain the throughput advantages of error masking circuits and at the same time reduce the power/area overheads compared to TIMBER. The [7] and the proposed scheme.
design margin reclaimed is used to reshape the combinational circuits for power/area savings. Experimental results show a power and area saving of 34% and 13% respectively in the fetch pipeline stage and, 46% and 6% respectively in the execute pipeline stage. On the circuit level, we get a rising C to Q delay reduction of 18.21%, falling C to Q delay reduction of 13.83%, maximum power saving of 23% and minimum power saving of 25.6% compared to reference TIMBER flip-flop. The deterministic nature of the proposed method and less cost overheads compared to other speculative error resilience approaches makes it a viable option for error resilience when facing high error rates.
