Abstract We propose a low-overhead, one-cycle timing-error detection and correction (EDAC) technique for flip-flop based pipelines. In order to prevent data collision during local clock gating for rapid error correction, the proposed technique performs clock gating of the master and the slave latches inside the flip-flops independently. Unlike previous flip-flop based one-cycle EDAC techniques, the independent clock gating in the proposed technique enables selective replacement of EDAC flip-flops, thereby reducing the area and power consumption overhead. Our experiments using a 3-stage pipeline consisting of 8-bit multipliers showed that the proposed technique improved the area and power consumption by 66% and 88%, respectively, compared to the state-of-the-art flip-flop based EDAC technique while showing a comparable area and power consumption with the two-phase latch based EDAC technique. A 32-bit, 5-stage MIPS microprocessor data path testchip based on the proposed technique was implemented in a 65 nm CMOS technology. With the proposed onecycle EDAC technique, the silicon measurement results from 31 dies showed 24.3% higher throughput and 8.7% less energy consumption beyond the point of the first failure (PoFF).
Introduction
As CMOS technology is being deeply scaled down, a significant amount of timing margin should be assigned to guarantee the correct operation of application-specific integrated circuits (ASICs) and system-on-chips (SoCs). This excessive timing margin is mainly needed to account for the increased sensitivity of the propagation delay to the process, voltage, and temperature (PVT) variations, as well as aging effects such as bias temperature instability (BTI) [1] .
Two major design approaches have been proposed to address the issues with such large timing margins [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26] .
The first design approach involves predicting the timing error with on-chip sensors or canary circuits and adjusting the operating voltage and/or frequency dynamically to prevent the real occurrence of timing errors [2, 3, 4, 5, 6, 7, 8, 9, 10] . However, it is difficult to predict timing errors induced by rapidly changing and localized variations due to the slow response time and limited number of on-chip sensors.
The second design approach involves detecting the real occurrence of a timing error and correcting it during runtime [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26] . One of the most popular on-chip timingerror detection techniques involves double-sampling the data; in the first sampling, the main flip-flop/latch is utilized and in the second sampling, a shadow latch/flipflop driven by a delayed clock is utilized [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26] .
Once a timing error is detected, fast correction is desirable in order to increase the throughput while guaranteeing the correct operation. Early EDAC techniques adopted global clock gating [11, 19] , counterflow pipelining [11] , and instruction replay [12, 14] to recover a timing error in a pipeline. However, these correction techniques require several cycles to recover the timing error. Consequently, the throughput of the pipeline is degraded as the number of errors increases. In addition, the multi-cycle error correction requires invasive architectural changes so that the architecture needs to be designed considering the error correction scheme from the beginning of the design phase [20] .
To enable architecture independent one-cycle timing error correction, M. Fojtik et al. proposed a local clock gating scheme based on two-phase clocking with transparent latches [20] . However, its usage is limited because most digital systems are based on edge-triggered flip-flop; additional design works such as 'pipeline retime' are thus required to convert the flip-flop based pipeline into a twophase latch based pipeline [20, 21, 22, 23] . In addition, considerable penalty is involved both in area and power consumption when a pipeline is retimed.
I. Shin et al. demonstrated a two-step clock-gating scheme dedicated to flip-flop based pipelines to overcome the issues in the two-phase latch based technique [24, 25] . However, in order to prevent data collision, the shadow latch inside the EDAC flip-flop must receive incoming data during local clock gating. Thus, all flip-flops in the pipeline stages should be replaced with the EDAC flip-flops to guarantee integrity. This limitation results in significant overhead in area and power consumption not only from the increased number of shadow latches but also from the increased burden in the clock network used to drive the shadow latches.
To reduce the area and power consumption overhead in state-of-the-art one-cycle EDAC techniques, we propose a new one-cycle EDAC technique for the flip-flop based pipeline. The local clock-gating scheme in the proposed EDAC technique independently stalls the master and the slave latches inside a flip-flop to recover the timing errors within a cycle with no data collision. Unlike the two-phase latch based technique, retiming is not required because the proposed technique is dedicated to flip-flop based pipelines. Moreover, unlike the previous flip-flop based technique, no shadow latch is required to prevent data collision because the master latch inside the EDAC flip-flop, instead of the shadow latch, receives the incoming data from the previous stage while the slave latch is stalled. Therefore, the proposed EDAC technique allows selective replacement of EDAC flip-flops in the pipeline, which leads to substantial savings both in area and power consumption compared to the previous flip-flop based EDAC technique [24, 25] .
The remainder of this paper is organized as follows. The proposed one-cycle EDAC technique is introduced in section 2. Comparative analysis of the area and power consumption overhead for the previous works and the proposed technique are discussed in section 3. Testchip implementation and silicon measurement results are presented in section 4. Finally, we conclude the paper in section 5.
Proposed one-cycle error detection and correction technique
2.1 Proposed local clock-gating scheme The main idea of the proposed error-correction scheme is to avoid data collisions by gating clock signals for the master latch (m_clk) and the slave latch (s_clk) inside flip-flops independently [26] . Note that the m_clk and the s_clk are typically complementary in the conventional flip-flop based pipelines. Fig. 1 shows a comparison of the data flows in the local clock-gating condition between the conventional flipflop based pipeline and the proposed pipeline. We assume that s_clk receives clk except in the local clock-gating condition.
In the conventional scheme, the master latches in both flip-flops (FF1 and FF2) are enabled and receive data d3 and d2 from the previous stages at t0. A half-cycle later, at t1, local clock gating is applied to FF2 while FF1 is not stalled. Another half-cycle later, at t2, the data d3 from FF1 flows into FF2 and collides with the d2 because the master latch in FF2 is still transparent. In fact, the collision could have occurred at t1 if data d3 had a short data path. In contrast, the proposed scheme applies the local stalling to the master or the slave latches selectively in each half clock cycle. At t2, data d3 does not collide with data d2 because the clock of the master latch in FF2 is gated. At t3, the clock-gating bubble is moved to the slave latch in FF1. The bubble propagation prevents data collision and enables local clock gating for one-cycle error correction. Fig. 2 shows an example of the clock-gating bubble propagation flow in the proposed one-cycle EDAC technique when a timing error occurs. Note that the error is always detected at the slave latch of the EDAC flip-flop. Once an error is detected in the slave latch of a flip-flop (stage C), the master latch in the next flip-flop (stage D) is stalled in the next half-cycle to block false data propagation. Meanwhile, the error is corrected in the failed stage by restoring the main flip-flop with the late arriving data D2. Another half-cycle later, the bubble starts propagation in both forward and backward directions to enable local clock gating. Each latch inside a flip-flop is stalled once per error; one-cycle error correction is thus realized. As shown in Fig. 2 , the proposed one-cycle EDAC technique operates [24, 25] , which required all the flip-flops to be replaced with EDAC flip-flops, thereby saving area and power consumption.
We argue that the core principle of local clock gating and the bubble-propagation mechanism of the proposed technique are similar to that used in the two-phase latch based technique which detects the error at every other stage [20] . On the other hand, a difference is observed in the data-restoring mechanism in an EDAC sequential element. In the two-phase latch based technique, the main latch continues to receive the late arriving data when an error occurs; restoring from the shadow latch is thus not necessary. However, in the proposed technique, the shadow flipflop receives the late arriving data, and hence the correct data needs to be sent to the main flip-flop from the shadow flip-flop for error recovery, which is similar to the scheme used in [11] and [12] .
Meanwhile, the proposed technique should not allow indefinite propagation of the bubbles along the loop [20, 24, 25] . In the proposed bubble propagation flow, the bubble propagation is stopped when the bubbles coming from both directions meet each other as shown in Fig. 3a . Consequently, the indefinite propagation of the bubble is prevented. Note that, because the error is only detected at the slave latch, in all cases, the backward and forward bubbles do not cross each other without meeting at a stage. This renders the clock-gating control logic of the proposed technique simpler than the state-of-the-art flip-flop based technique [24, 25] . Multiple errors in a cycle are handled in a similar fashion as shown in Fig. 3b. 
Proposed error detection and correction flip-flop
and clock-gating control logic Fig. 4 shows schematic views of the EDAC flip-flop and the clock-gating control logic for the proposed one-cycle EDAC technique. The flip-flop structure shown in Fig. 4a is similar to that proposed in [11] except that m_clk and s_clk are controlled independently and the data restoring path is connected to the slave latch instead of to the master latch. In the clock-gating control logic shown in Fig. 4b , FCG_in (BCG_in) indicates the bubble input from the previous (next) stage. When FCG_in (BCG_in) is 1, the control logic generates the clock-gating signals for m_clk and m_clkd (s_clk and s_clkd). The control logic also sends out the clock-gating signals to the control logic for the slave (master) latches with which it communicates in both forward and backward directions. The FF_ history in the control logic stores the clock-gating information for the previous cycle and prevents the clock gating when FCG_in (BCG_in) is 1 if the latch was stalled in the previous cycle. The shadow flip-flop inside the proposed EDAC flip-flop detects the timing error and sends the error signal to the slave part of the clock-gating control logic that is driving the failed stage. Then, clock-gating bubble propagation is performed to restore the pipeline from the timing error, as shown in Fig. 2 and Fig. 3 . 
Analysis of area and power consumption overhead
For the comparative analysis of the area and power consumption overhead in state-of-the-art one-cycle EDAC techniques and the proposed EDAC technique, we designed a 3-stage pipeline with 8-bit multipliers in a 65 nm CMOS technology as a test circuit. Commercial EDA tools for logic synthesis [27] , placement and route [28] , and gate-level power analysis [29] were utilized for the designs. The maximum operating frequency for the test circuit was set at 333 MHz.
For the two-phase latch based technique [20] , all the flip-flops in the baseline design were replaced with the combination of the master and the slave latches. In the pipeline retiming process, no time borrowing was allowed for either the master or the slave latches and only the master latches were set to be movable same as in [20] .
For the state-of-the-art flip-flop based technique [24, 25] , all the flip-flops in the baseline design were replaced with the EDAC flip-flops. All the hold time constraints for the shadow latches were met by inserting hold fix buffers. The speculation window for all the EDAC flip-flops was set at 20% of the clock cycle.
Finally, for the proposed technique, we replaced 12.5% of all flip-flops with the proposed EDAC flip-flops. Fig. 5 shows the area and the power consumption for each EDAC technique. For the two-phase latch based technique, the overhead in the area and power consumption were increased by 25% and 26%, respectively compared to the baseline design. The major overhead of the two-phase latch based technique compared to the baseline design is the increased number of master latches and combinational cells required to meet the new timing constraint during retiming and the increased size of the clock network used to drive all latches. Fig. 6 shows the result of the register retiming for each stage in the test pipeline design. The total number of latches in each stage was 52; the number of master latches increased at each stage was 20. In the analysis of the two-phase latch based technique, the area and power consumption overhead derived from the shadow latch inside the EDAC latch was not considered because the number of the EDAC latches can vary depending on the variations of the two-phase latched based technique. Note that the overhead from retiming alone was comparable to that of the proposed technique. Considering the area and power consumption of shadow latch will provide more favorable comparison results for the proposed technique.
For the state-of-the-art flip-flop based technique, the major overhead in the area and power consumption is, as expected, the increased number of buffers needed to fix hold time violations. The area and power consumption for combinational cells was increased by 91% and 103% compared to the baseline design, mostly because of the hold fix buffers. The overall overhead in the area and power consumption was 99% and 118% respectively.
By enabling selective replacement of the EDAC flipflops in the proposed EDAC technique, the area efficiency was improved by 66% and the power consumption was reduced by 88% compared to the state-of-the-art flip-flop based techniques. It also showed a similar area and power consumption overhead compared to the two-phase latch based technique.
Our experimental result showed that substantial savings in area and power consumption can be achieved with the proposed one-cycle EDAC technique while maintaining the edge-triggered clocking scheme.
Testchip implementation and silicon measurement 4.1 Testchip implementations
We implemented two testchips in a 65 nm CMOS technology to validate the operation and performance of the proposed one-cycle EDAC technique. The first testchip (testchip1) was implemented to check the functional correctness of the proposed one-cycle EDAC technique. Fig. 7a shows a schematic diagram of a 5-stage pipelined data path based on programmable delay cells (PDCs), which was implemented in the testchip1. The input data of the pipeline is generated by a 4-bit pseudo random number generator (PRNG). Intentional timing errors can be generated at each stage by setting the corresponding Fig. 7b shows the schematic view and the operation of the PDC placed between the pipeline flip-flops. Four multi-phase clocks (clk, clkd, early_clk, and late_clk) are generated by a conventional phase-locked loop (PLL). As shown in the timing diagram, the speculation window is set to =2 of the clock cycle; early_clk leads clk by =4 and late_clk lags clk by =4. Thus, the early_data leads clk and no timing error occurs when ERR_ IN is set at 0. On the other hand, the late_data lags clk and a timing error occurs when ERR_ IN is 1. However, even when ERR_ IN is set to 1, the output of the PDC leads clkd and the timing error is detected and corrected by the shadow flip-flop inside the EDAC flip-flop.
We also implemented another testchip (testchip2) for a 5-stage data path pipeline of a 32-bit MIPS microprocessor employing the proposed one-cycle EDAC technique to measure throughput and energy consumption.
All components in testchip1 including the EDAC flipflop, clock-gating control logic, PRNG, and PLL are implemented and verified using the custom design methodology. Fig. 8a shows the layout view of testchip1.
Testchip2 was implemented using a cell-based design methodology based on the high-level synthesis. Fig. 8b shows the layout view of testchip2.
Silicon measurement results
We measured the output of testchip1 in several timing-error conditions for a single stage and multiple stages. All measurement results were exactly matched with the expected data from the post-layout simulation [30] results. Fig. 9 shows one of the comparison results between the measurement data and simulation data. Input signal error [3] enables PDC in Stage D, shown in Fig. 7a , and the intentional timing error is detected at the EDAC flipflop in Stage D. A half clock cycle later, the master latches in the EDAC flip-flops in Stage E are stalled to prevent error propagation. At a half cycle later, clock-gating bubbles start to propagate in both directions. Thus, whenever timing errors occur in Stage D, the next stage (i.e. Stage E) is stalled for a clock cycle and the output signal data_out[3:0] is consequently maintained.
For 31 dies of testchip2, we measured the throughput and energy consumption at multiple operating voltage and frequency conditions. Fig. 10a shows the average throughput and timing error rate measured at 0.7 V. Below the point of the first failure (PoFF) the throughput gradually increases as the operating frequency is increased. However, even beyond PoFF of 36.3 MHz, the throughput still increases due to the proposed one-cycle correction; the maximum throughput frequency is 48.6 MHz. The measured throughput for the PoFF and the optimal operating frequency are 37 and 46 instructions/µs respectively, and the gain in the throughput is 24.3%. Fig. 10b shows the average energy consumption and error rate measured at 40 MHz. Similar to the throughput results, energy consumption is still decreased even beyond PoFF and the optimal supply voltage is 0.66 V, while the PoFF is 0.725 V. The gain in the energy consumption from the proposed one-cycle EDAC technique is 8.7%.
Conclusion
We presented a low-overhead, one-cycle timing error detection and correction (EDAC) technique for flip-flop based pipelines. One of the key advantages over the existing twophase latch based EDAC scheme is that pipeline retiming is not required because it is a flip-flop based scheme. The main advantage over the previous flip-flop based one-cycle EDAC technique is that substantial savings in area and power consumption are possible due to the capability of the proposed technique of the selective replacement of the flipflops. Our experimental results showed that 66% and 88% of improvements in the area and power consumption, respectively, can be achieved compared to the state-ofthe-art flip-flop based technique. Silicon measurement results from 31 dies of a 32-bit, 5-stage MIPS microprocessor showed increased gains in throughput and power consumption beyond the point of the first failure (PoFF) due to the one-cycle error correction with the proposed scheme. The IEICE Electronics Express, Vol. 16, No.11, [1] [2] [3] [4] [5] [6] average gains in throughput and power consumption compared to the values at the PoFF were 24.3% and 8.7%, respectively.
