Abstract-This paper presents a methodology and chip demonstration to design near-/sub-threshold voltage (V t ) pipelines using pulsed latches that are clocked at very wide pulses. Pulsed-latch-based design is known for time borrowing capability but the amount of time borrowing is limited due to hold time constraint. To enable more cycle borrowing, in this paper, we aim to pad short paths to ∼1/3 cycle time using multi-V t cell library. While delay padding using multi-V t cells is common in super-V t design, the small delay difference among multi-V t cells has not allowed such extensive short path padding due to large area overhead. However, in near-/sub-V t regime, circuits delay becomes exponentially sensitive to V t , suggesting that high-V t cells can significantly reduce the overhead of padding. We build a semi-automatic short path padding flow around this idea, and use it to design: 1) ISCAS benchmark circuits and 2) an 8-bit 8-tap finite impulse response (FIR) core, the latter fabricated in a 65-nm CMOS technology. The chip measurement shows that the proposed FIR core achieves 45.2% throughput (frequency), 11% energy efficiency (Energy/cycle), and 38% energy-delayproduct improvements at 0.35 V over the flip-flop-pipelined baseline. The measurement results also confirm that the proposed FIR core operates with the same pulsewidth setting robustly across process, voltage, and temperature variations.
I. INTRODUCTION
N EAR-/sub-threshold voltage (V t ) circuit design is one of the promising approaches for increasing energy efficiency in digital systems. The key challenges in this approach are large performance degradation and extreme variability. Among several recent efforts to mitigate those challenges, the aggressive two-phase latch-based sequencing was used for time borrowing, which can improve performance and tolerance of the variability [1] , [2] . However, two-phase sequencing has an inherently larger sequential overhead than flop-based design [3] . Additionally, the logic depth per latch stage is reduced by half after re-timing, which can exacerbate the impact of local variations.
To overcome the shortcomings above, the pulsed-latch can be an attractive option in designing pipelines for near-/sub-V t digital circuits because of its lower sequential overhead and larger variation tolerance than the edge-triggered flipflops (FFs) [4] - [7] . Some previous works [8] - [16] , [33] - [37] researched on pulsed-latch-based circuits design. Compared with the two-phase latch-based design [30] - [32] , the pulsedlatch-based design can consume smaller sequential logic area (while ignoring pulse generation and distribution overhead). Also, there is no need to perform re-timing during pipeline design. We can simply replace all the FFs with pulsed-latches to immigrate from an FF-based circuit to a pulsed-latch-based circuit. In addition, the logic depth per latch stage is reduced by half after re-timing, which can exacerbate the impact of local variations in the two-phase latch-based pipeline. Furthermore, the pulsed-latch-based design keeps the time borrowing ability.
However, the hold time constraint limits the pulsewidth. For example, one of the existing studies used narrow pulses whose width is <∼5 Fan-Out-of-4 (FO4) delays [5] . As a result, the time borrowing ability and variation tolerance of pulsedlatch based pipelines are limited. The small amount of time borrowing ability only allows roughly a 10% decrease in the clock period [5] , which is smaller than the level that a twophase latch pipeline can support [17] (half the cycle time [T C ]). To enable more amount of timing borrowing, pulsewidth allocation combined with clock skew scheduling or re-timing was proposed in previous works [8] - [16] . These techniques can decrease the clock period by an additional 20% [5] . However, the large worst case delay variability of near-/sub-V t circuits makes it challenging to apply these techniques.
Moreover, the pulsewidth exacerbates the clock network design. It is a challenging task to distribute narrow pulses as the slew can be easily degraded when a pulse travels through a network. To guarantee the integrity of the narrow pulse shape, the design of pulsers (pulse generators) and placement of pulsed-latches should be carefully performed [5] . Coupled with wire parasitics, the chip requires strong buffers in its pulse distribution network and local pulse generators [5] , [18] , incurring large area and power overhead.
In this paper, we propose a methodology to design pulsedlatch-based pipelines for near-/sub-V t operation with more cycle borrowing yet low overhead [21] . Specifically, we pursue to enable the use of wider pulses in pipelines by padding short paths using multi-V t cells. Padding with multi-V t cells is not a new practice in super-V t pipeline design. However, as the delay difference between regular-V t and high-V t cells is not significant in super-V t digital circuits, it is unusual to pad short paths more than a fraction of cycle time, e.g., 5 FO4 delays. In this paper, we take advantage of the large delay difference between multiple V t cells in near-/sub-V t circuits, and pursue to pad short paths to ∼1/3 of clock cycle time. This enables a proportionally larger amount of cycle borrowing per stage, thereby improving cycle time by ∼33%. Moreover, the reduction of cycle time also saves active-leakage energy dissipation, since it equals to the product of leakage power and cycle time when circuits are in the active mode. The overhead of the proposed multi-V t padding is 2.7 times less than that of the regular-V t padding.
We also devise a semi-automatic short path padding flow using commercial-grades logic-synthesis and automaticplacement-and-route (APR) tools. We then apply the flow to implement the ISCAS benchmark circuits. The results show that our methodology reduces the average area overhead of short path padding by 2.71 times as compared with the padding using single-V t cells. Furthermore, the wide pulsewidth simplifies clock network design, which could require four times more buffers if narrow pulses had to be used.
Based on the earlier proposed techniques, we implement an 8-bit finite impulse response (FIR) core in a 65-nm generalpurpose CMOS process. The measurement shows that the core operates robustly with 12 FO4 delay wide pulses (∼1/3 of T C of the core) across process, voltage, and temperature (PVT) variations. With such large cycle-borrowing capability, the core also achieves 45.2% better throughput, 11% less Energy/cycle, and 38% smaller energy-delay-product (EDP) at 0.35 V, as compared with the edge-triggered FF design. The area overhead is 15%. This paper is organized as follows. In Section II, we describe our proposed techniques and the design of pulsed-latch pipelined 8-bit FIR core, including the design principle, short path padding method, clock network distribution, and FIR core design. Section III presents the measurement results of the test chips, including the measurements and comparisons across PVT variations. Finally, the conclusions are drawn in Section IV. Fig. 1 shows the concept of the classical pulsed-latch-based pipeline. The conventional standard latches are used in the pipeline, and all the latches in a pipeline are driven by a single-phase clock (clkp), whose high-phase time is defined as T pw . In a typical logic path, the sum of the D-to-Q delay of a latch (T dq ) and logic delay (T logic ) should be smaller than T C . If a path has a delay longer than T C , however, the path may borrow time from the next stage. The maximum time that can be borrowed from the next stage (T borrow ) is constrained by T pw minus the setup time of a latch (T setup ). This implies that wider pulsewidth gives more time to borrow. However, the maximum T pw is also constrained by the delay of the shortest path in a pipeline, because the delay of any path less than T pw (named short path) will cause hold time violation. This constraint can be illustrated as follows:
II. WIDE-PULSED-LATCH-BASED PIPELINES

A. Design Principle
Although it is desirable to pad a short path longer (i.e., making its delay longer by adding delay cells) and enable more time borrowing, excessive short path padding causes large area overhead. Our experiment using an 8-bit multiplier shows that the short path padding to 12 FO4 delays, which is about 1/3 T C of the multiplier, can cause 51.7% area overhead. Because of this severe overhead, narrow pulses (<5-6 FO4 delays [5] ) are typically used in conventional pulsed-latch pipelines [5] , [27] , [28] . Indeed, the performance improvement from cycle borrowing is proportional to T pw . The cycle time of a pulsed-latch pipeline, T C,pulse , can be derived as
where the cycle time of an FF pipelined baseline is T C,ff . Hence, the maximum frequency improvement (Max. FI) over the baseline can be formulated as
where F base is the frequency of the FF pipelined baseline, and F pulse is the frequency of the pulsed-latch pipeline. Fig. 2 shows the maximum FI curve across different T pw values when T C,ff is 40 FO4 delays. For a design with T pw = 1/10 T C,ff , the maximum FI can be 11%; while for a design with T pw = 1/3 T C,ff , the maximum FI can be up to 50%. This promises a significant amount of performance improvement by using wide pulses in pulsed-latchbased pipelines.
The above frequency improvement analysis considers only one path and if one path borrows time from the next stage, the next stage will have a more strict constraint. In practice, however, it is very rare an input exercises only critical paths across all the pipelines [29] , which the pulsed-latch pipeline can take advantage of the frequency improvement in the entire pipeline.
B. Proposed Short Path Padding Technique With Multi-V t Cells
To reduce the overhead for short path padding, we exploited exponential relationship between V t and delay in near-/sub-V t circuits. We simulated the delay and power of three typical logic cells, i.e., full adder, AND gate, and inverter, with different V t values. As shown in Fig. 3 , high-V t cells have significantly longer delay and smaller power dissipation 1 than regular-V t and low-V t cells. This implies that we can use fewer high-V t cells than regular-V t and low-V t cells for the same amount of short path padding, saving both area and power. While multi-V t design is well known for super-V t circuits, we find that the use of the multi-V t library for short-path padding is very effective particularly for near-and sub-V t circuits, since high-V t cells are exponentially slower than regular-and low-V t cells. With the multi-V t library, we can have a large amount of padding at low area and power overhead, allowing the use of wider pulses. This also alleviates the overhead associated with pulse generation and distribution.
We performed an APR experiment on an 8-bit multiplier. In the experiment, we padded all short paths of the multiplier with different T pw values and analyzed the area overhead. As shown in Fig. 4 , at 12 FO4 delay T pw , the padding area and power overhead are 51.7% and 91.2% with regular-V t cells; while the padding area and power overhead are only 19.3% area and 15.2% with multi-V t cells. This marks 2.7 times area and 6 times power reduction. Fig. 5 shows the path delay distribution before and after the short path padding. Before padding, the shortest path delay is originally 3 FO4 delays; however, after delay padding, the shortest path delay is extended to 14 FO4 delays (about 1/3 of the cycle time) without compromising the critical path delay.
C. Design Flow for Short Path Padding
The overhead of padding is sensitive to circuit structures. To verify the effectiveness of our proposed techniques across a range of various circuits, we devised a semi-automatic flow to perform the multi-V t -based short path padding in near-/sub-V t circuits. Fig. 6 shows the flowchart of the proposed flow. It starts with a logic synthesis tool (Synopsys Design Compiler) with constraints on both critical paths and short paths. The synthesis tool uses a multi-V t library, which we characterize at V DD = 0.35 V using Cadence Encounter Library Characterization. Then, we perform another iteration of short path padding in the APR phase using Cadence Encounter. This iteration is critical to consider the impact of parasitics on timing. Then, we perform static-timing analysis to determine whether the timing constraints are met. The short paths information and long paths information are generated using Synopsys PrimeTime. If the timing constraints are not met, we iterate the above process until they become met.
With this flow, we performed padding on nine circuits from the ISCAS'85 and 11 circuits from the ISCAS'89. The total 20 circuits contain both combinational and sequential logics. For each benchmark, we set the target T pw to 1/3 T C of the original circuits (org). Tables I and II summarize the area and power overhead and other metrics. Our proposed multi-V t padding achieves 14.21% (area) and 9.87% (power) average overhead across 20 benchmark circuits while padding using single-V t cells exhibits 30.98% (area) and 42.4% (power) average overhead. Consequentially, the average area and power overhead reduction from padding with single-V t cells to that padding with multi-V t cells are 2.71 times and 4.3 times, respectively.
As a summary, the use of multi-V t library for padding enables significant reduction in padding overhead. This large degree of improvement is unique to only near-/sub-V t circuits, since high-V t cells become much slower relative to regularand low-V t cells only in near-/sub-V t circuits. 
D. Wide-Pulsed-Latch-Based FIR Prototype
We designed an 8-bit 8-tap data-broadcast architecture FIR filter based on pulsed-latch pipelines (Fig. 7) . The dashed lines show the location of pulsed-latches. The multipliers, based on Baugh-Wooley architecture, were padded with high-, low-, and regular-V t logic and delay cells (Fig. 8) . The critical path delay is not changed. Each multiplier has two pipeline stages, employing total 34 pulsed-latches. We also designed ripple carry adders using the same padding technique. T C of the FIR core is 40 FO4 delays and the target T pw is set to 1/3 T C . After the delay padding, the shortest path delay is ∼14 FO4 delays.
We also designed the pulse generator and pulse distribution network. Fig. 9 shows the distribution of the clock network and the schematics of the shared pulse generator. We implemented only one pulse generator for all the latches (∼400) in the FIR core. Also, we designed 1-level pulse distribution tree based on the merged buffer scheme [19] , [20] for low skew and low power. The configurable pulse generator can supply a pulsewidth from 5 FO4 delays to 20 FO4 delays for the FIR chip.
Such a simple clock design is feasible, because the wide pulse can alleviate the slew requirement in pulse distribution. We set the high phase time of pulse to no less than 5 FO4 delays for meeting latch setup time and allowing some amount of timing borrowing. If pulses need to be narrow, the slew time should be short enough to ensure sufficient time of pulses being high. For example, for the pulsewidth <7 FO4 delays, we need to reduce the slew time down to ∼1 FO4 delay. This requires a very strong buffer to compensate wire resistance, or requires to embed local pulse generators each of which is shared by only nearby pulsed- latches. In our design, the wide pulsewidth (∼12 FO4 delays) relaxes the slew constraint, e.g., to 3 FO4 delays 2 , and simplifies the clock design.
We quantitatively analyze the required buffer size to distribute different widths of pulses at 0.35 V using post-layout 2 Excessive relaxation of slew can degrade T dq and T setup of latches.
clock network. As shown in Fig. 10 , it is difficult to reliably distribute pulses whose T pw is <6 FO4 delays. This is because the wire parasitics are too large to achieve the required slew time (0.5 FO4 delay assuming 5 FO4 delays of high-phase duration). We also need 72×, 36×, and 18× buffers to distribute pulse whose T pw is 6, 7, and 8 FO4 delays, respectively. For T pw > 8 FO4 delays, however, we can keep using 18 times buffer strength for the slew constraint (<3 FO4 delays). The use of narrow pulses could require up to four times more buffering in distribution network.
Monte Carlo simulations confirmed that the proposed design is robust across random and systematic process variations. As shown in Fig. 11 , the SPICE simulations with the RC-extracted netlists of the longest path, the shortest path, and the clock tree confirm that the maximum of T pw is always less than the minimum of the shortest path delay across random process variations. In addition, the minimum of the shortest path delay is still larger than 1/4.5 of the maximum of the longest delay, confirming that the proposed technique can offer a good amount of time borrowing.
In the experiments with both random and systematic process variations effects, as shown in Fig. 12 , the ratio of the minimum of the shortest path delay to the maximum of the longest path delay is in the range of 0.22 to 0.35, again confirming our proposed techniques can offer a good amount of time borrowing. This ratio indicates that with the good ability of time borrowing, our proposed techniques allow a 22% to 35% reduction in clock period. Also, the ratio of the maximum of T pw to the minimum of the shortest path delay is in the range of 0.71 to 0.96, which shows the maximum of T pw is always less than the minimum of the shortest path delay across five different corners, and thus there is no hold time violation. These simulations show the robustness of the proposed FIR design with the shortest path delay of 14 FO4 delays, which contains a margin of 1 ∼ 2 FO4 delays.
III. TEST-CHIP AND MEASUREMENT RESULTS
A. Test-Chip Organization
Test chips for the FIR cores have been fabricated in a 65-nm general-purpose CMOS process. Fig. 13 shows the die photograph of the test chip. Each chip contains both the proposed and the FF-based baseline designs. The total area of the chip is 0.12 × 0.6 mm 2 , the core area of proposed FIR is 0.0105 mm 2 , and the core area of baseline FIR is 0.0091 mm 2 .
B. Performance Measurements
We measure the functionality and the maximum F CLK across T pw values at room temperature. As shown in Fig. 14 , the FIR is functional at 5 to 12 FO4 delays of T pw . If T pw is too small, the FIR core fails due to unreliable pulse distribution. If T pw is too large, it also fails because of the hold time violation. The maximum F CLK improves approximately linearly with T pw thanks to the greater amount of time borrowing. Fig. 15 shows T C of the baseline and the proposed FIR and T pw at the performance-optimal point. The optimal T pw is found to be between 10 and 14 FO4 delays at V DD = 0.33 to 0.4 V, which is roughly 1/3 of T C . The T C reduction from the baseline FIR to the proposed FIR is approximately equal to the optimal T pw (1/3 T C ), which is consistent to (2). We also measure the delay and Energy/cycle across V DD values. As shown in Fig. 16 , the proposed FIR core achieves 20.2-MHz, 1.31-pJ/cycle, and 65-ns · pJ EDP at T pw = 12 FO4 delays and V DD = 0.35 V. The baseline FIR core achieves 13.9-MHz, 1.47-pJ/cycle, and 105-ns · pJ EDP. F CLK is 45.2% higher than that of the baseline, and Energy/cycle and EDP are 11% and 38% less than those of the baseline core.
C. Measurement Across PVT Variations
As the pulse generator is on the same die with the FIR core, it can track the FIR core across PVT variations. This allows us to keep using the same T pw setting across PVT variations. As shown in Fig. 17 , across V DD = 0.33-0.4 V, we find that T pw that makes the proposed FIR core functional is from 6 to 12 FO4 delays. Fig. 18 shows that F CLK improvement at the optimal T pw (12 FO4 delays) is 24%-45% and energy savings are 7.3%-12%.
Across process variations in 11 chips, the functional T pw is measured to be 7 FO4 to 12 FO4 delays (Fig. 19) . We also measure the F CLK and Energy/cycle of the proposed cores configured with the single T pw setting and the baseline cores using FFs. As shown in Fig. 20 , F CLK improvement is 17.4%-56% and energy savings are 9.1%-24.2%. The mean (m) and standard deviation (σ ) of F CLK improvement are 30% and 13%, respectively. The mean and standard deviation of energy savings are 16% and 5%, respectively.
We also measured the functional T pw of a typical chip across temperature variations. As shown in Fig. 21 , the functional T pw across 0°C-80°C lies from 9 FO4 to 14 FO4 delays. When operating at 0°C-80°C, as shown in Fig. 22 , the proposed FIR core achieves 16.7% to 51.6% higher F CLK and 5.4% to 14% less Energy/cycle over the baseline. Table III summarizes the measurement results across PVT variations. Across PVT variations, the core can operate with the largest common T pw : 12 FO4 delays. The maximum F CLK improvement at this T pw is ∼50% and the maximum energy savings are 24.2%, as compared to the baseline design. 
D. Comparisons
The core area and power breakdown of the baseline and proposed FIR are shown in Fig. 23 . The core area of the baseline FIR is 9110 μm 2 , with a logic area of 6374 μm 2 posed FIR. The latches in the proposed FIR are 26% smaller than the FFs in the baseline. Combined together, the area overhead is 15%, as compared with the baseline. Table IV summarizes the comparisons of the proposed and the baseline FIR cores. At the design point (V DD = 0.35 V), the F CLK , Energy/op., and EDP improvements are 45.2%, 11%, and 38%, respectively.
As shown in Figs. 2 and 4, higher T pw can enable higher performance, but it can also cause higher energy and area consumption. The F CLK improvement will be limited as we increase T pw because of two reasons. The first is that we cannot gain much beyond a certain point (T pw,limit ), since the last pipeline stage would not have enough time to compute as it gives too much time to the previous stages. The second reason is that when T pw is greater than half cycle time, the total maximum time can be borrowed is still equal to or less than half cycle time. Hence, when T pw is greater than T pw,limit or half cycle time, we can receive no performance improvement yet only energy penalty. To elaborate on this, we performed the simulation to characterize the power-delay-product (PDP) of our proposed pulsed-latchbased FIR core across different T pw values. We summarize the simulation results in Fig. 24 . It shows the PDP is optimized at T pw =∼1/3 T C . Table V shows the comparison results of our FIR core and the state-of-the-art FIR cores [7] , [22] - [26] . The proposed design achieves among the best energy-figure of merit (FoM) (normalized to the process node) [26] behind only by the works presented in [25] and [26] . The works in [25] and [26] employ no variation tolerant techniques, and achieve about two to three orders of magnitude lower throughput than our proposed design. We also compare the FIR cores in the energy and throughput tradeoff (i.e., Energy-FoM/Throughput), which shows that our design also achieves among the best tradeoff behind only by the work presented in [7] . Note that [7] uses a low-power process technology and also is designed for super-V t operation.
IV. CONCLUSION
This paper presents a methodology to design pulsed-latch pipelines in near-/sub-V t circuits that can use very wide pulses. Such wide pulses can severely increase area overhead due to excessive short path padding, and thus is not considered a common design choice in super-V t digital pipeline. In this paper, we propose a multi-V t -based padding technique to scale the overhead, which becomes significantly more effective in near-/sub-V t pipelines. Experiments with the ISCAS benchmark circuits show that our technique can consistently reduce overhead by ∼two times. The measurement of the FIR prototypes based on the proposed technique demonstrates 45.2% F CLK , 11% Energy/cycle, and 38% EDP improvement over the baseline using edge-triggered FFs. The proposed core can also operate at the single T pw setting (12 FO4 delays) robustly across PVT variations. The area overhead is 15%.
