Abstract Voltage scaling is an effective technique for ultra-low-power applications. However, PVT variation degrades the robust of traditional synchronous pipelines severely when voltage scales into the subthreshold region. In this paper, we propose a register-based bundleddata asynchronous pipeline that can operate robustly in sub-threshold, called Snake. By looping the match delay line, the Snake halves the design overhead compared to other asynchronous pipelines. We also propose a practical asynchronous design methodology which is compatible with commercial EDA and needs only a few modifications to synchronous design flow. Monte-Carlo SPICE simulation shows that the pipelined multiplier applying the proposed techniques operates stably in 0.2V and achieves minimum power 1.3nW in 0.2V, minimum energy 1.07pJ per cycle in 0.3V. It provides 6.7 times superiority over synchronous baseline design with 22% area overhead. Comparison with other works in the state of art shows the proposed techniques are quite competitive.
Introduction
Striding toward the world of IOT and AI, it is indispensable to transfer and compute a mass of data, which poses great challenges to integrated circuits in terms of performance and energy [1, 2] . In many applications confined by the battery, energy exceeds performance to become the major concern in design trade-off. The demand for energy efficiency has motivated designers to scale voltage to the sub-threshold region due to the quadratic relation between dynamic power and supply voltage [3, 4] . However, along with the voltage scaling down to sub-threshold, the circuit delay degrades severely and become extremely sensitive to process, voltage and temperature (PVT) variation, which easily leads to timing errors. The traditional method of sparing timing margin to cover the PVT variation is no longer feasible because of the unacceptable overhead. To deal with the problem, resilient design techniques are proposed [4, 5, 6] . Their basic principle is to monitor the critical path timing and react to the timing situation in situ. Once the timing is violated, it is corrected by error detection logic before error process logic increases the supply voltage to relieve the workload. However, these synchronous resilient techniques have defects in common. First, they are susceptible to metastability if data arrives near the clock transition, rather than after [7] , [8] . Second, error detection window requires a large number of padding in short paths to avoid misdetection, which may introduce more power consumption than saving [6] . Third, error correction usually requires architecture modification [8] . Last, it requires error detection and correction within one clock cycle in the presence of degraded delay and slow LDO reaction [9] . Recently, researchers have shown great interest in the potential of asynchronous circuits being applied for variation tolerance [8, 10, 11] . Different from synchronous circuits, asynchronous circuits use local clocks to synchronize between neighbor registers. The meticulous neighbor handshaking renders the circuits inherently resilient, thus more variation-tolerant, energyefficient and speed-fast. Nevertheless, asynchronous designs face three main challenges. First, the design overhead may outbalance the benefit. Second, asynchronous designs are much more complicated, especially the asynchronous timing constraint. Third, they lack commercial EDA supporting. In this paper, we present two contributions with respect to the challenges. First, we propose a novel bundleddata [12] asynchronous pipeline for sub-threshold operation, called Snake. It is characterized by a looped request delay line (LRDL), which almost halves the design overhead compared to other asynchronous pipelines. In addition, the asynchronous pipeline is based on registers rather than latches, which simplifies the relative timing constraint (RTC) [13] . Second, we propose a practical design methodology perfectly compatible with commercial EDA, with only a few modifications to synchronous flow. To demonstrate the effectiveness of the proposed techniques, we study a case using a 16x16 pipelined multiplier targeting SMIC 55nm CMOS technology. The post-layout SPICE simulation and Monte-Carlo iterations show that the multiplier operates successfully in 0.2V with minimum power 1.3nW, and achieves minimum energy 1.07pJ per cycle in 0.3V. The rest of this paper is organized as follows. Section 2 reviews the related work of asynchronous designs in subthreshold and methodology. Section 3 presents the proposed Snake pipeline, its RTC, and performance. Section 4 introduces the asynchronous design methodology. Section 5 shows the experiment results and analysis. Section 6 concludes the paper.
Related work

Asynchronous sub-threshold design
In [10] , Chang employs piecewise paralleled delay line as critical path replica (CPR) to alleviate the variationinduced mismatch between CPR and logic delay in a bundled-data asynchronous pipeline. To further decrease performance and energy penalty due to return-to-zero time in four-phase handshaking, he substitutes CPR buffers with And-gate type buffers skewed toward pulling down. In [11] , Liu adapts the MOUSETRAP asynchronous pipeline [14] to sub-threshold operation. He adds an extra delay to the control paths that make the latches opaque, thus widening the capture window and improving the ability to capture data delayed by random PVT variation. In [8] , Hand introduces an asynchronous template called Blade, which incorporates error detection concept from synchronous resilient techniques. Variation-induced violation is notified to previous and successive stages to trigger the recovery mechanism between neighbor stages. Owing to the meticulous handshaking, metastability and architectural modification in synchronous counterpart are avoided.
Asynchronous methodology
In general, Asynchronous design methodology can be classified into two groups [15] . One is to design dedicated languages that best capture the fine-grain concurrency and distributed synchronization of asynchronous pipelines, along with the synthesis and optimization tools [17, 18, 19] . Although it is a more ideal method, it requires redesign using new languages and conflicts with existing RTL designs. Moreover, the readily-available synthesis tools are at a higher level, leaving little optimization opportunity for designers. The other is to provide compatibility with existing synchronous languages and EDA tools. [20, 21, 22] redesign asynchronous standard cell (ASTC) libraries in NCL [22] style. After synchronous synthesis, they translate the netlists to ASTC. In [23] , Cortadella utilizes a bundled-data template for the translation, allowing the use of commercial optimized STC. However, he fails to optimize the RTC during the synthesis and placement, thus the RTC may not be met.
Asynchronous template
This section first introduces the asynchronous control unit and a simple FIFO to exemplify the operation of Snake. Then we present two primitives, JOIN and FORK, to extend the applications in complicated nonlinear pipelines. Last, we elaborate the timing constraint and performance of Snake.
Asynchronous FIFO
In a classical bundled-data control unit [12] , the request signal indicates the arrival of data while the acknowledgment signal denotes the success of receiving data. There is a one-sided timing constraint for bundleddata: data must arrive before request. Therefore, a delay line propagating the request is required to match the combinational logic of data, which is the main overhead of asynchronous templates. The Snake is different. Fig. 1 shows the structure of the asynchronous control unit (ACU) in Snake. To reduce the overhead, the request is acknowledged halfway. By subtle design, the first half and second half delay line for propagating request can share, forming a looped request delay line (LRDL). The LRDL halves the design overhead while keeping the performance and variation tolerance. The request signal propagates in LRDL likes a snake, which is the origin of the pipeline name. It is essential to understand the function of each gate before we analyze the behavior of FIFO. The register with an inverter on its data path is used to convert specific direction transition at the clock pin to arbitrary direction transition at the output pin. To be specific in Fig. 1 , a positive transition of req N can trigger a transition of ack N in the upper register while a negative transition of req N can trigger a transition of done N in the under register. The NOR gate is used to block the next request before the current request is handled. The XOR gate serves as OR function for transitions at the inputs. Any transition of done N or ack N+1 produces a transition of req N+1 .
Consider the two-stage pipelined FIFO in Fig. 2 . Each ACU generates the local clock for multi-bits wide data registers. The handshaking timing is shown in Fig. 3 . Initially, all req and ack signals are reset to 0. ACU N sends a request to ACU N+1 by setting req NL . In the meanwhile, the positive transition of req NL triggers the data registers to send new data. After passing through LRDL N , the positive transition of req NR is sensed by the upper register in ACU N+1 then an ack N transition is sent back to ACU N , resetting req NL . After passing through LRDL N again, the negative transition of req NR is captured by under register in ACU N+1 , setting req (N+1)L . Then the positive transition of req (N+1)L triggers registers in stage N+1 to accept the new data, which indicates the end of handshaking in the current stage and the start of handshaking in the next stage. Two thorough observations may help comprehension: 1) the moment req NL is set, request along with data is sent to the next stage. In the meanwhile, NOR in ACU N is gated by req NL to block new request to ACU N in case of flushing previous request and data. 2) req N acts like 4-phase while ack N acts like 2-phase. To be clear, in complete handshaking, req N is set and returns to zero while ack N just inverts once. In other words, req N is state-signaling while ack N is transition-signaling. We elaborate on the advantages of the asynchronous pipeline as follows: 1) It uses registers to store states of control and data signals, rather than latches in typical asynchronous templates. On one hand, it makes the design glitchless and robust even under sub-threshold operation. On the other hand, it is free from the timing loop that latch templates undergo, simplifying the RTC constraint.
2) The LRDL multiplex the first half and second half request delay line, thus halving the design overhead, further suitable for ultra-low-power applications.
3.2 Nonlinear pipeline: JOIN and FORK In real systems, pipelines are usually nonlinear, which means a stage may be requested to receive data from multi stages, or requests to send data to multi-stages. To extend the applications in nonlinear pipelines, we provide elegant nonlinear interfaces: JOIN and FORK elements, as is shown in Fig. 4 . With the simple elegant interfaces, the nonlinear stage can communicate with multi predecessor stages and successor stages reliably.
Timing constraint and performance
As is mentioned above, a one-sided RTC must be met that the request arrives earlier than data by setup time of the register. It poses a constraint for the length of LRDL:
Where and are the total delay of gates in JOIN and FORK elements respectively, and is the propagation delay of data. By introducing , Eq. (1) can be expressed as:
) The NOR gate in ACU blocks the next request (NR) when current request (CR) is not acknowledged, which pose a hold constraint for the pipeline that CR should arrive at NOR earlier than NR. In Fig. 2 , the moment when CR propagates to req NL after a transition of ack N , it produces a negative transition to release NR blocked by NOR in ACU N . Thereafter, CR propagates in stage N while NR propagates in both stage N-1 and N before they arrive at the inputs of NOR in ACU N+1 . The constraint can be expressed as: (5) Therefore, hold constraint Eq. (3) is satisfied inherently in the pipeline. Glitch-free is guaranteed by the constraints above, along with the proposed pipeline timing. With the application of bundled-data scheme, sequential hazards in the data paths are eliminated [14, 26] , as long as the setup constraint is satisfied. By restricting the transitions of multi-inputs in combinational gates, combinational hazards in the control paths are defeated [27, 28] . To be specific, in JOIN element, req0 N and req1 N only transit in the same direction in the stage request, where it is similar in FORK element. In NOR and XOR in ACU, only one input transits at a time due to the hold constraint and the protocol timing. We use cycle time [28] to describe the performance of the asynchronous pipeline. Cycle time is the time interval between successive data items when the pipeline operates at maximum speed. A cycle for stage N starts from a positive transition of req N and ends at the subsequent positive transition of req N+1 . Therefore the cycle time of the stage is = 2 • + + (6) = + + (7) Eq. (7) indicates that the cycle time of a stage depends on the logic of this stage, irrelevant to other stages. The performance is superior to synchronous design, in which the cycle time of a stage is the global clock cycle time, depending on the longest stage.
Design methodology
The design flow is automated by Synopsis EDA tools through some homemade TCL scripts. The scripts include EDA built-in commands for circuit analysis, and common program code for information process, such as set, if, foreach. The flow is shown in Fig. 5 . It starts from a synthesizable RTL design. After synthesized to a synchronous netlist, it is imported in PrimeTime (PT) for topology analysis. The needed topology information includes stage critical paths and stage connections. A stage is the abstract terminals of a multi-bits wide data path, represented by multi-bits widen registers in the circuits. The registers in the same stage are identified by setting a stage name using built-in command set_user_attribute. Stage critical paths are acquired by iterating every pipeline between two stages using built-in command get_timing_paths. By analyzing the startpoins and endpoints of the stage critical paths further, we can get the stage connection information, which is stored in arrays, a TCL data structure. For each stage, one array stores its predecessor stages and the other array stores its successor stages. The netlist is then imported in Verdi for ECO. By iterating every stage, the clock in netlist is replaced with distributed asynchronous templates according to the stage connection information acquired in the topology analysis. In detail, the clock net is removed by built-in command _ecoDeleteConnection. The ACU, JOIN, and FORK are inserted by built-in command _ecoAddInst. (1) in the absence of LRDL. Therefore we insert LRDL in the pipeline to optimize the timing in DsignCompiler (DC) by built-in command set_min_delay. The length of each LRDL is determined through Eq. (2), with the timing information of stage critical paths. In the back-end, the clock tree synthesis in regular flow is replaced with asynchronous timing optimization, where the timing constraint is the same as that in frontend. Apart from this, other steps keep the same with regular backend flow. As can be seen, the proposed design methodology has the following merits: 1) It is compatible with existing RTL design. It is crucial because commercial designs are based on IP, which does not allow designers to modify the design.
2) It is compatible with commercial EDA. The added steps are automated by Synopsis EDA using built-in commands.
3) RTC optimization is simple because it is derived from the register-based template.
Experiment results
We use a 16x16 pipelined multiplier as a benchmark to evaluate the proposed techniques. The initial RTL is synthesized at 200MHZ, in 1.08V, SS corner, and 125℃ targeting SMIC 55nm process. Thereafter, the asynchronous design is implemented as a layout through the proposed flow. For comparison, a synchronous baseline design is also implemented at the same condition through traditional design flow. Then the layouts are extracted with parasitic RC for accurate SPICE simulation. We run 100 Monte-Carlo iterations for each supply voltage. The results show that the circuit operates robustly in 0.2V, almost the breakdown voltage of STC. Fig. 6 shows the timing diagram for a multiply operation in 0.3V, SS corner and 0℃. Fig. 7 shows the distribution of energy and cycle time over 100 Monte-Carlo iterations in 0.3V. We investigate the energy consumption and performance further by operating the pipeline at maximum throughput, fed with the random operands. The performance is evaluated by cycle time while the energy is evaluated by energy consumption per cycle or average power consumption over a cycle. Fig. 8 shows the relation of the energy and cycle time with supply voltage. As can be seen, the minimum power point is 0.2V with 1.3nW, while the minimum energy point is 0.3V with 1.07pJ. Compared to synchronous baseline design, the asynchronous design occupies 22% more area, which is expected to be even less if the data width is bigger or the clock distribution in the original design is more complicated. As for energy, the proposed technique reduces the energy, which is 8.24pJ per cycle in the baseline, by 6.7 times. The energy efficiency comes in three ways: 1) the tolerance for variation in sub-threshold supply voltage; 2) eliminated unnecessary transitions in clock paths; 3) resilient cycle time for each stage.
The comparison of our work with other state of art is summarized in Table I . It indicates that the proposed techniques are superior over other works in the state of art in sub-threshold ultra-low-power applications. 
Conclusion
In this paper, we propose a novel asynchronous pipeline, called Snake, for ultra-low-power applications. The inherent resilience of the asynchronous template makes it variation-tolerant in sub-threshold. Based on registers rather than latches, the template simplifies the timing constraint and makes the design more robust in subthreshold. By acknowledging halfway and looping the delay line, the pipeline halves the design overhead. In addition, we propose a practical asynchronous design methodology with only a few modifications to synchronous flow. It is compatible with existing RTL design and commercial EDA. Besides, the RTC optimization in the flow is quite simple. Monte-Carlo SPICE simulation manifests that the multiplier applying the proposed techniques works robustly in 0.2V. It achieves minimum power 1.3nW in 0.2V, and minimum energy 1.07pJ per cycle in 0.3V at maximum throughput. Compared to synchronous baseline design, it decreases energy by 6.7 times with 22% area overhead. It is also quite competitive compared to other works in the state of art.
