ABSTRACT Near-threshold computing brings several times of magnitude improvement in energy efficiency of digital circuits. However, it also introduces several times of deteriorated delay variations caused by process, voltage, and temperature (PVT) variations. In situ timing monitoring-based adaptive techniques can mitigate excessive timing margins caused by PVT variations, but current frequency and/or voltage tuning methods cause large performance loss. In this paper, we propose a low overhead timing error prediction monitor and a super-fast clock stretching circuit to solve this problem. They are both optimized for nearthreshold voltage of 0.5 V. When there are timing margins, the frequency will be increased. Until when the timing is intense due to variations, timing monitors will generate a predicted alarm signal. Accordingly, the system clock will be stretched immediately to avoid real timing errors. Applied on a 40-nm CMOS Bitcoin Miner chip, simulation results show that the whole system operating at near-threshold voltage can increase the frequency to up to 2.1× compared with the original non-monitored circuit. Our method can increase the energy efficiency to mitigate near-threshold variations effectively.
I. INTRODUCTION
Near-threshold (NT) computing is very promising in integrated circuit design, in that it brings several times of magnitude improvement in energy efficiency [1] , [2] . However, it also introduces several times of deteriorated delay variations caused by process, voltage and temperature (PVT) variations. Take a chain of 50 F04 inverters as an example, its delay variation (3σ /µ) increased from 11%@0.8V to 25%@0.5V in 22nm CMOS process, as stated in [2] . To prevent potential timing failures due to severe PVT variations at near-threshold, sufficient timing/voltage margins have to be added according to the worst-case timing analysis of traditional digital IC design flow. However, the worst case may not happen while the real-time PVT situation is unpredictable, thus it leads to great waste of performance and power consumption.
To mitigate this over-pessimistic timing margin waste problem, many adaptive techniques based on in-situ timing monitoring had been proposed [3] - [11] , [17] - [21] , such as RazorII [3] , Razor-lite [4] , DSTB [5] , iRazor [6] , HEPP [8] and so on. They monitored the circuit timing during runtime, and used adaptive voltage scaling (AVS) to decrease power consumption, or adaptive frequency scaling (AFS) to increase working frequency as much as possible. There are two kinds of in-situ timing monitoring techniques: error detection and correction (EDAC) [3] - [7] and timing error prediction [8] , [9] . EDAC systems can eliminate the timing margin completely, but their error recovery mechanism and short-path paddling problem usually lead to complex design flow and relatively large area overhead. In-situ timing monitors need to be inserted in many critical paths, thus they are expected to be very compact-structured in order to decrease the area overhead. In the early designs, RazorII [3] , Canary Flip-Flop [8] and double-sampling with time-borrowing (DSTB) [5] needed at least 47-70 transistors to detect the timing error, which becomes a heavy burden. Later, Razor-lite [4] was proposed with only 8 additional transistors, and iRazor [6] was proposed with only 3 extra transistors besides a latch. However, Razor-lite and iRazor could not work properly at the near-threshold voltage, , because they monitored the virtual rails VVDD and VVSS, who had some voltage difference to real VDD or VSS, which made it hard to decrease supply voltage to very low. The sparse error-detecting latch (EDL) was designed for near-threshold, but used 43 transistors [7] . Since the best energy efficiency point locates at the near-threshold voltage region for digital circuits [1] , it becomes necessary for a timing monitor to be able to work reliably at near-threshold voltage (NTV) now, which is a challenge.
Another big issue is that, in these in-situ timing monitoring systems, once there is a timing error/warning, they usually need to decrease the frequency immediately to avoid further errors. However, adaptive frequency scaling (AFS) based on PLL configuration usually needs a very long lock time. Thus, clock half division and clock gating became straightforward methods [3] - [6] , [8] , [9] by decreasing the frequency to half immediately. But they caused significant throughput loss during the error correction period. For example, the instruction replay incurred 11 cycles performance penalty at a slow half frequency [4] , corresponding to 22 cycles at the original frequency. Recently, adaptive clocking techniques were proposed to mitigate the effect of fast voltage droop [12] - [16] . They could stretch the clock period very quickly with little throughput loss. They selected different phase clocks generated by a DLL to stretch the system clock [15] , or used a DPLL [16] . These adaptive clocking techniques aimed to provide a faster clocking to mitigate the fast-changing supply droop, which might also provide a potential solution for the fast frequency tuning of in-situ timing monitoring systems. However, very few of them could provide near-threshold applications. Plus, some adaptive clock systems had the drawback of not short enough response time. For example, [15] needed 1-3 cycles to stretch the clock cycle, which might cause timing failure when timing errors occurred in in-situ timing monitoring applications.
Therefore, in this paper, we propose a low overhead timing error prediction monitor and a super-fast clock stretching circuit for variation-tolerant near-threshold circuits.
Our main contributions are: 1) A 14-transistor transition detector based timing prediction monitor for NTV operation. . It has negligible power consumption compared to a typical flip-flop. Plus, there is no need of additional error correction mechanism and no short-path problem due to its error prediction mechanism; 2) A low overhead, super-fast (as fast as within-a-cycle) clock stretching circuit to obtain fast frequency adjustment, which greatly reduces the frequency tuning penalty. It is also optimized to work at NTV; 3) A series of delay cells are used to generate multi-phase clocks, which reduces the area cost and design complexity.
The proposed timing monitor and clock stretching circuit are implemented on a Bitcoin Miner chip under SMIC 40nm CMOS process. Monte Carlo simulations of the timing monitor show that it can work reliably at 0.5V. Plus, it has only 8% switch energy increase and 4% leakage power increase compared to a standard flip-flop (FF). The proposed clock stretching circuit can stretch the clock cycle immediately (within-a-cycle) when the stretch signal is enabled. It can operate at 0.5V supply voltage. Applied on a Bitcoin Miner chip, system simulation results show that it can increase the frequency to up to 210% compared to the nonmonitoring system working at signoff frequency.
II. TIMING MONITORING SYSTEM ARCHITECTURE
We propose a novel timing monitor and a clock stretching circuit, and implement them on a Bitcoin Miner chip under SMIC 40nm CMOS process. Bitcoin Miner chip is a highly computation intensive circuit which keeps calculating when it is powered on. It is mainly composed of series of parallel SHA-256 blocks and a controller, and there is no SRAM inside. Here we only use two SHA-256 blocks as a prototype chip.
As the system architecture shown in Fig. 1 , a series of timing monitors are inserted at the end of selected critical paths of the main circuit. Then all the 'Pre_error' signals are ORed together by dynamic ORs to get a total error signal of 'Pre_error_all'. When a timing error of 'Pre_error_all' is detected as high voltage, it is used as the stretch enable signal to control the clock stretching circuit, which will stretch the system clock immediately. Thus, 'Pre_error_all' is sent to frequency control FSM (Finite state machine) to generate a stretched clock in order to avoid the real timing error. We also design a duty cycle corrector in order to set an appropriate detection window as well as to tune the unbalanced duty cycle after clock stretching. On the other hand, when there are no timing errors for a long period, the frequency will be increased gradually to VOLUME 6, 2018 improve the circuit performance. Therefore, by using our variation-tolerant timing monitoring technique, the excess timing margins in near-threshold circuit can be mitigated greatly to achieve high energy efficiency.
III. TIMING PREDICTION MONITOR DESIGN
Here we design a novel timing prediction monitor, whose function is to predict a timing error when the timing in a critical path is very intense but not erroneous. It can detect whether there is an input data transition during the ''detection window''. It is also called transition detector (TD). Here detection window (Tw) is usually a short period before the rising edge of the clock.
Its circuit schematic is shown in Fig. 2 , which is mainly composed of an inverter with a transmission gate inserted in its middle node and a XOR gate. It has only 14 transistors in total, consisting of 4 CMOS transistors and a 10-transistor structured XOR in the standard cell library. Its main working principle is as follows. During the detection window (the negative clock period here), M3 and M4 in the transmission gate are shut off. Then, if there is an input data transition in this period, there will be a voltage difference between the two internal nodes of A and B. Thus, the XOR will output a 'Pre_error' signal.
This structure is quite simplified and has no threshold loss problem, thus it is suitable for NTV operations. However, the above structure has a potential risk if the transition time of the input data is relatively large. Because at this time, transistors M1∼M4 might be turned on simultaneously, which may cause a faulty alarm signal. In order to avoid this potential risk, the sizes of transistors are carefully designed by using large sized M1 and M2. Its functionality and robustness are verified by 10k Monte Carlo simulations at 0.5V and 0 • C, with extensive process variations regarding to both global variations and local variations. As shown in Fig. 3 (b) , the proposed timing monitor can work reliably at 0.5V with no faulty alarms.
The layout of the timing monitor is shown in Fig. 3 (a) , with a small area of only 2.14 µm 2 . It is 0.48× area of a FF in the standard cell library. We also compare its power consumption and delay time with a typical flip-flop in the standard library under 0.5V, SS corner, 0 • C at 80MHz frequency with switching rate of 10%, as shown in Table 1 . It can be seen that the power overhead of our timing monitor is negligible compared to a standard FF, which is only 0.08× switch energy and 0.04× leakage power.
IV. CLOCK STRETCHING CIRCUIT DESIGN
Once there is a timing error/warning detected by timing monitors, it usually needs to decrease the frequency immediately to avoid further errors. Traditional half-division frequency tuning methods caused severe performance degradation. Here we propose a low overhead, super-fast clock stretching circuit to fulfill this function. It is a low overhead and easy-to-design way, because we propose to use a series of delay cells to generate multi-phase clocks instead of using a DLL as [15] . Thus, area cost as well as design complexity are reduced.
Its architecture is shown as Fig. 4 , consisting of x a clock phase generator (CPG), y an adaptive clock controller (ACC), z a PVT monitor (PVTM) and { a clock phase selector (CPS). First, CPG generates a series of clocks with the same frequency but different phases. Then one particular clock phase is picked by CPS to generate a stretched clock, with control signals from ACC. Here CPS is equivalent to a multi selector. PVTM is used to calibrate the selection of a particular phase clock. Detailed working principles are as follows. 
A. PHASE CLOCK GENERATOR (CPG)
As stated, CPG uses a delay chain to generate clocks with multiple phases, with its input connected to the system clock. Each of the two adjacent clock phases has the same phase difference to maintain the stability of clock stretching. The delay cells in the delay chain are a series of specially designed delay cells, in order to achieve an optimization in timing, area, driving strength and power consumption. The length of one delay cell determines the minimum clock stretching amount.
The delay chain also should be long enough to generate one-cycle delay according to the low power, low frequency requirements. Here for AFS tuning applications at NTV, 60 delay cells are used in total. Here the delay cells in CPG are the same as those in PVT monitor, therefore, the delay variations would not be a problem even in nearthreshold. That is because, regardless of the delay variation, the PVT monitor can always help select that phase clock when the accumulated phase reaches 360 • .
B. PHASE CLOCK SELECTOR (CPS)
The function of phase clock selector is to select one of the multi-phase clocks as the only output clock. Thus, one and only one bit of the control signals Sel [N:1] is high in each cycle, making no race and hazard in the output. Control signals of phase clock selector come from the adaptive clock controller. They need to be synchronized with the corresponding clock phases to avoid the output clock glitches. The synchronization is fulfilled by flip-flops. Each clock phase will be selected if its corresponding control signal is enabled.
C. CLOCK STRETCHING PRINCIPLE AND ITS TIMING DIAGRAM
A key point in clock stretching is to provide consecutive stretched clock for a certain amount of time, thus, the selection of clock phases must be continuous. Here it is controlled by the adaptive clock controller. The timing diagram in Fig. 5 shows its working principle. Suppose stretch_amount = 1, when the 'Stretch' signal is high, 
D. PVT MONITOR
As shown in the top left of Fig. 4 , a simplified PVT monitor is used to provide calibration of the clock phase selection. It is used to detect which phase clock just has the same phase as the original Clock 0 but with a phase difference of 360 • . It is composed of a ring oscillator, which oscillates under the control of a clock division signal under different PVT conditions. The number of oscillations (denoted as N Counter ) varies depending on the PVT conditions. Since N Counter has a relationship with the system clock period, it is used as a guidance to obtain the target phase clock.
E. CLOCK STRETCH CIRCUIT DESIGN AND SIMULATIONS
The clock stretching circuit is designed under 40nm CMOS process with a small area of 85 * 83 µm 2 , using only 1727 cells. It is optimized for 0.5V near-threshold voltage, thus is able to support the near-threshold AFS system. Its layout and design details are shown as Fig. 6 (a) & (b) .
The circuit has been simulated under different PVT conditions with different stretch amount configurations as well. Simulation results show that it functions well in all test modes.
Our circuit is able to provide fine stretch amounts when given appropriate controls. Its minimum stretch amount is the delay time of only one delay cell. To make it easy to illustrate, we provide simulation results of stretching amounts of 1/4 Tclk and 1/2 Tclk as examples. Fig. 7 shows its simulation results when the stretching amount is 1/4 Tclk VOLUME 6, 2018 at 0.5V, SS corner, 125 • C. It can be seen that the system clock is stretched at the negative clock edge when the 'Stretch' signal is high, and it is expanded to 5/4 cycle period immediately in the next cycle. When the 'Stretch' signal is low, it stops stretching and returns to its normal clock cycle. And the simulation results of a 1/2 clock cycle stretch amount are shown in Fig. 8 , which also show its correct functionality and its super-fast response time (within-a-cycle).
V. ALL-DIGITAL DUTY CYCLE CORRECTOR
In our timing prediction monitor, we use the negative clock phase as the detection window. However, Tw has to be a short period before the rising edge of the clock, in order to predict the timing just a little before the real timing violation. Plus, the duty-cycle of the stretched clock after the clock stretching circuit becomes quite different if different stretch amounts are applied. Therefore, a duty-cycle corrector becomes a necessity.
Here an all-digital duty cycle corrector is proposed, as shown in the middle part of Fig. 1 , also shown as Fig. 9(a) . It can lock the width of the negative clock cycle to make it suitable to be the detection window. It is mainly composed of The control of clock duty cycle is based on the truth table of the SR-latch, as shown in Fig. 9 (b) . S and R are not allowed to be '0' at the same time. If S and R are '1' at the same time, the output of SR-latch will keep. And if there are signal transitions to make S = R, its output will change accordingly too. The timing illustration is shown as Fig. 9 (c) . It can be seen that the low voltage narrow pulse of S is the inverted signal of the Pulse Generator generated signal, and the pulse of R is delayed for some time after the pulse of S. Therefore, a continuous clock signal is generated with a fixed negative clock period.
Since the negative clock phase is used as the detection window, how big or small this window should be is important. Because too big window would introduce unnecessary alarming (over pessimistic), while too small could lead to missing timing violation (over optimistic). Therefore, we design a configurable delay chain in this duty-cycle controller, whose delay composes the detection window while it can be reconfigured for different requirements.
VI. CIRCUIT IMPLEMENTATION
We implement the proposed timing monitor and clock stretching circuit on a Bitcoin Miner chip under SMIC 40nm CMOS process. It has 2310 FFs and occupies a core area of 0.21 mm 2 .
There are two voltage domains in this chip as shown in Fig. 10 , one is the 1.1V domain for the PLL IP, and the other is the NTV domain for all other circuits, including the bitcoin miner core, monitors, clock stretching circuit, duty cycle corrector and AFS state machine.
Our timing monitor based AFS system needs a special design flow besides the traditional digital circuit design flow. That is because inserting the timing monitors at the critical paths endpoints could not be done in RTL design phase, since the critical paths could not be distinguished at RTL design time. Therefore, a specially modified design flow is used here, as shown in Fig. 11 . Here, after the traditional IC design flow of RTL design, voltage domain design, logic synthesis, place & route, and parameters extraction, we run static timing analysis (STA) to obtain the first 10% of critical paths as the monitored paths, whose endpoint FFs locations are used to insert the timing monitors. Then we go back to RTL design phase to insert timing monitors in these places. After that, we run another round of traditional design flow to get a modified layout with inserted monitors. After STA of this layout, if there are still a few very critical paths left to be monitored, we use ECO to insert the monitors. Finally, after the design rule check, the layout is finished.
In the end, a total of 211 FFs in the top 10% critical paths are inserted with our monitor among all 2310 FFs, corresponding to a 9.1% insertion rate. The total area overhead is only 4.3% including the frequency control modules.
VII. SYSTEM SIMULATIONS RESULTS
First, the function of the Bitcoin system is verified by digital design tools. However, since they could not provide accurate monitoring based AFS simulation, HSIM-VCS co-simulation is further used to verify the AFS monitoring effect. Here the clock stretching circuit and the critical paths inserted with timing monitors are simulated by HSIM, which can provide a transistor-level simulation with high accuracy. The other parts are simulated by VCS and interface with HSIM by providing the needed signal stimulus. Here the Bitcoin system uses the recharacterized standard cells at 0.6V from our previous fabricated chips, thus its system simulation is based on 0.6V. It has the ability of working at 0.5V since the most critical parts of our timing monitor and clock stretching circuit are all designed for lower voltages of 0.5V. Fig. 12 is the whole timing prediction based AFS tuning process at NTV region. At first, the frequency (represented by 'Freq_show') gradually increases from 88M to 124MHz, where the first error prediction appears, as shown in Fig. 12(a) . Then the 'Pre_error_all' signal activates the clock stretch procedure, to make the clock cycle stretch immediately and last for a few cycles, as shown in Fig. 12(b) . Thus, a potential timing error is avoided. In order to evaluate the performance benefit due to our technique, we first obtain the frequency baselines by simulating the system at the worst case according to [4] and [7] , which is SS corner, a 10% reduced supply voltage and −25 • C due to reverse temperature effect at NTV. The baseline frequency is 88MHz for near-threshold voltage. We obtain the first warning frequency generated by the timing monitor at different corners and temperatures. As shown in Fig. 13 , at the best case (FF corner, 125 • C), our timing monitor can work at 185MHz, which is up to 2.1× frequency compared to baseline frequency at NTV. For TT corner and SS corner, their frequencies also increase to 148MHz and 124MHz respectively, which corresponds to 1.68× and 1.41× frequency of the baseline frequency.
The comparisons of other timing monitors are listed in the up-half part of Table 2 . Our proposed timing monitor can operate at NTV of 0.5V, with a relatively small number of transistors and little power overhead. Plus, it does not need to change the original flip-flop. Thus, it has much less interference on the original circuit than EDAC techniques, leading to a simplified design flow and a small area overhead. Our frequency tuning method uses the super-fast clock stretching circuit, which has much less performance penalty to commonly used clock division or clock gating methods. Compared to the non-frequency-penalty method using local VDD boosting technique in [7] , our method has an advantage of all-digital design, which is easy to implement. For the whole AFS system, it has a low monitor insertion, a low area overhead and a high performance gain.
VIII. CONCLUSION
In order to deal with the severe delay variations in nearthreshold integrated circuits, we propose a low-overhead timing prediction monitor with only 14 transistors. We also propose a low-overhead clock stretching circuit optimized for NTV, which can stretch the clock cycle within-a-cycle when there is a timing error. Thus, the whole system can operate at its optimum condition with high energy efficiency and super-fast frequency tuning ability. Applied on a Bitcoin Miner chip under SMIC 40nm CMOS process, simulation results show that this timing monitoring based AFS system can work at near-threshold region and improve the circuit performance greatly. She is currently an Associate Professor with the National ASIC Center, Southeast University, Nanjing, China. She has authored or co-authored over 40 technical papers in conferences and journals, and authorized over 15 invention patents. Her research mainly focuses on variation resilient adaptive VLSI circuits, ultra-low power SoC design, and countermeasure techniques of security circuits. (M'14) MINYI LU received the B.S. degree in electronic engineering from Southeast University, Nanjing, China, in 2017, where he is currently pursuing the master's degree in electronic engineering. His research mainly focuses on low power digital circuit design and information security circuit design.
XINNING LIU
LIANG WAN received the B.S. degree in electronic science and technology from Tianjin University, Tianjin, China, in 2015. He is currently pursuing the master's degree in microelectronics and solid state electronics with Southeast University, Nanjing, China. His research mainly focuses on low power digital integrated circuit design and adaptive clock design.
JUN YANG (M'14) received the B.S. and Ph.D. degrees in electronic engineering from Southeast University, Nanjing, China, in 1999 and 2004, respectively. He is currently a Professor with the National ASIC System Engineering Research Center, Southeast University. His research interests include system-on-a-chip design and navigation systems. VOLUME 6, 2018 
