Abstract: Among prior short paths padding algorithms, greedy heuristic based on the number of short paths had been proven to be area saving and fast. For ultra-low supply multi-voltage designs, however, delay buffers explode due to improper padding locations caused by unconsidered delay variations. To overcome this problem, we propose a new evaluation function "effectiveness" for greedy heuristic, to characterize the benefit of buffer insertion, thereby to optimize the allocation of padding delay and reduce buffer overhead. An improved algorithm to dynamically update slacks during delay assignment stage is also introduced. Experimental results show average area reduction of 91.6% compared to the prior method.
Introduction
NTC (near-threshold) or STC (sub-threshold) designs have been expounded better energy efficiency than that with standard supply voltage [1] . However, under these conditions, circuits need to tolerate astonishing PVT variations; hence huge timing derate has to be added to constraints to ensure circuit functionality. Meanwhile, clock skew under these conditions becomes irreparably large, especially for timing path across different voltage domains, as clock buffer's delay varies significantly under different voltages. Under such circumstances, ultra-low supply designs suffer much more hold violations than usual, which lead to great area cost for padding short paths. Thus, it becomes critical to limit total introduced buffers when padding short paths.
Short paths padding algorithms have basically two directions: one is ILP (Integer Linear Programming) formulation, by solving equations constructed from timing constraints [2, 3, 4] . This method was first proposed in [2] and was preferred by later works because this mathematical approach had optimal solutions. However, compute cost grew exponentially with circuits scaling larger, and became impractical when dealing with VLSI circuits [5] . In [4] the authors proposed a graph reduction method for ILP formation in order to reduce runtime. Yet the experimental results did not show radical improvement. The other method is greedy heuristics, first referred to in [2, 6] . The method in [2] decided locations with the maximum setup margin to maintain the minimum clock period but was area consuming and abandoned in today's nano-scale designs. [6] chose locations with the max number of violated paths to maximize the benefit of buffer insertions. This method was also adopted in [5, 7, 8] . [7] introduced a procedure first to size down buffers, then implemented the greedy method. While [5] introduced a procedure first to retime or relocate critical registers. This paper also did an exhaustive comparison of ILP and greedy approaches. [8] mainly focused on a new algorithm aimed at predicting the minimum period for resilient circuits. [5, 7, 8] did not actually change the original algorithm of greedy heuristic proposed in [6] . Despite greedy heuristics may not give optimal solution, the area overhead is small enough to be accepted compared to ILP, and runtime is linear dependent to circuit scales [5] . So far greedy heuristic is still the best paper algorithm both in area reduction and runtime.
Greedy heuristic shows one disadvantage when it comes to ultra-low supply multi-voltage designs: the buffer delay variations between different locations or voltage domains were not taken into consideration. Actually, a buffer's delay shows drastic variations throughout the circuit with ultra-low supply voltage. This will cause inappropriate choices during padding allocation. Take the circuit in Fig. 1 as an example, path R0 to vd_l/R0 is setup critical, while R1 to both vd_l/R1 and vd_l/ R2 are hold violated. The prior heuristic method will first pick point A for padding, which has two short paths through (path:R1-A-B-vd_l/R1 and path:R1-A-C-D-vd_l/ R2). For instance, a buffer provides 0.1 ns delay under high voltage domain (typical under 1.1 V supply), and 1.0 ns delay under low voltage domain (typical under 0.6 V supply). Thus, padding on a point in high voltage domain is uneconomic, since padding on A requires 30 buffers, while padding on B and C, will require only 5 buffers in total. Besides, for points in the same low voltage domain, delay variations is also significant that some points may also become uneconomic. For instance, a buffer provides 1.0 ns on C, and 0.9 ns on D. If D is picked first, it will require more buffers than C. Based on our observation, prior padding methods, including commercial tools, introduced copious amounts of delay buffers when dealing with ultra-low supply designs.
We here propose a heuristic method to locate better sites for padding delay. Within our knowledge, this paper is the first to discuss short paths padding algorithms for NTC or STC designs. Innovations of our method are: we propose a new evaluation function "effectiveness", which includes the consideration of buffer delay variation when choosing locations. With this function, the priority of points in high voltage domains, or in low voltage domains but supply less delay, is tuned lower, thus serves to find better locations for buffer insertion. We also introduced a slack-based padding spreading algorithm and proved its accuracy by formulas. Compared with the method described in [5, 7] and commercial tool, experimental results showed that our method remarkably reduced area overhead for NTC and STC designs. The following of this paper is organized as: Part 2 describes our modeling and implementation of the algorithm; Part 3 exhibits experimental results; finally Part 4 concludes this paper. Fig. 2 is a brief look at our approach. The implementation starts with a post-route design, with static timing analysis (STA) done in Primetime. Unlike [8] in which slack on each vector is calculated by tracing arrival time and required time from start points or endpoints, necessary timing information is collected directly from a Primetime session, in order to maintain accuracy. A programmable graph model, GðV; EÞ graph, is also constructed beforehand, with all possible buffer locations marked as available. With acquired information, we calculate the new evaluation function for all available vectors. Vector with the highest priority is padded first, then marked as unavailable. The assigned padding delay will be spread out for the associated vectors, whose evaluation function will also be updated. Iterations of delay assignment and spreading will be repeated until no vectors are available. After realizing padded delay to a buffer insertion list, we implement ECO-PR and STA, then check whether the padding result meets timing constraints.
Algorithm

GðV; EÞ graph modeling
GðV; EÞ graph, as described in [3, 4, 5, 6, 8] , is the base for padding algorithm implementation. It's constructed from a set of vectors V, and a set of edges E that connects all vectors. As illustrated in Fig. 1 , all pins in the data path are presented as vectors (otherwise like in [8] , vectors present for gates), and all timing arcs as edges. The solid edges represent timing arcs from outputs to inputs (wires), while the dotted ones represent timing arcs from inputs to outputs inside gates (E ¼ E w S E g ). In this graph, padding delay will be determined on vectors, and then spread out along the edges. Based on the GðV; EÞ graph and basic concepts of timing analysis, we have following definitions and equations: Eq. (1, 2) represent the calculation of max and min slack of each vector, from prior vector's arrival time and following vector's required time. Eq. (3, 4) represent the calculation of max and min slack of each edge. Eq. (5, 6) represent the relationship between slack of a vector and slacks of its connected edges. Eq. (7) represent that the number of violated paths through a vector is the sum of violated paths through its connected edges. Eq. (8) represent that ths of a vector is the sum of ths of its connected edges. These equations are basis for our deduction and program:
R stands for setup require time for a vector, r for hold required time; A stands for max arrival time for a vector, a for min arrival time; S stands for setup slack for a vector or an edge, s for hold slack; d stands for delay of an edge; N stands for number of hold violated paths through a vector or an edge; ths stands for total negative hold slack, sum of slacks of all hold violated paths through a vector or an edge; stands for delay that will be introduced by a certain buffer on a vector; Let i 2 fanin of v, x 2 fanout of v: 
Evaluation function
As mentioned in above, the prior methods showed the disadvantage of over padding during timing closure stage. Greedy heuristic based on the number of short paths does not count in the factor of buffer delay variation, then the locations chosen would be area consuming encountered with small buffer delay. We need a new evaluation function which characterizes the benefit in THS reduction of buffer insertions. Lemma: Once a buffer is to be inserted on one vector v, the THS reduction to be introduced will satisfy:
ÁthsðvÞ minfjthsðvÞj; NðvÞ Ã ðvÞg; sðvÞ ! ðvÞ ð 9Þ jthsðvÞj ÁthsðvÞ; s ðvÞ ðvÞ ð 10Þ
For example, for vector v 1 whose slack list is fÀ0:2; À0:2; À0:06; À0:04g, In real circuits, there can be thousands of short paths coming through a certain vector. It's not realistic to calculate actual ÁthsðvÞ for these vectors with too many short paths through. We propose a new evaluation function "effectiveness" that can be easily calculated and satisfy Eq. (9, 10), also take buffer delay into consideration: 
With the proposed evaluation function, padding priority turns to B ¼ C > D > A. So the priority of A is downgraded, avoiding to pad on A in the first place.
Padding and spreading
Once we have a vector with the highest priority, padding delay will be assigned:
With this assignment, slacks of other vectors in common paths will be changed. Yet it would be time-consuming if we realize padding delay and re-update timing information whenever a new assignment occurs. It's more efficient to assign as much delays as possible on suitable vectors, and then realize them altogether. Thus we propose a padding delay spreading algorithm. Unlike spreading algorithm in [8] which needs to recalculate arrival time and required time, our approach starts from the padded vector and only consider slacks on each vector along. The data access in this stage is reduced hence. We use fi; j::g to denote fanins of v, and fx; y::g to denote fanouts of v, fu; v::g to denote fanins of x. Assume PðiÞ occurred on vector i, the updated setup slack of v can be calculated as: As for vector x, the setup slack of x can be calculated as: It can be similarly derived that for hold slack calculation, vpðvÞ is enough for latter vectors. Furthermore, it can be derived that this pattern of spreading padding delay also suits the situation from latter vectors to prior ones. This conclusion is exemplified in Fig. 3 : a small GðV; EÞ graph, with the initial delay of each edge, and initial slacks of each vector presented along. Both i and j are on hold violated paths through v and have setup margin. Assume i is prior to other vectors, and PðiÞ ¼ 3 is assigned to i for situation 1; and assume j is prior to other vectors, and PðjÞ ¼ 12 is assigned for situation 2. Vector v is visited first, with its new slacks and virtual padding delays calculated. Then slacks of x will be updated based on v only, without the need to concern about i or j. The detailed spreading procedure is exhibited in Fig. 3 . Further following vectors will be visited and updated similarly.
Concurrently when padding delay is spread along edges, the "effectiveness" function also needs to be updated for visited vectors. Padding on one vector will solve part of its corresponding vectors' ths, yet, not necessarily change their min slack, then "effectiveness" for them is possibly lowered. That is to say, the priority of vectors should be adjusted dynamically. However, the actual decrease in ths Because of delay padded on i, thsðvÞ has a certain decrease (Eq. (22)). Yet sðvÞ may not necessarily change, so it can be inferred that padding on i will lower N e ðvÞ. This consequence is consistent with the prior discussion. After N 0 e ðvÞ is obtained, E 0 ðvÞ will also be recalculated and the priority be reordered. For example, in Fig. 1 , In practice, the detailed program line is exhibited in Table I . For each selected vector with the highest E, the padding process (Padding-Spreading) will first pad the vector, then recursively call the spreading processes (Spreading-Forward and Spreading-Backward), until all fanin or fanout vectors are visited and updated. Then the process will iteratively turn to search for another available vector with the highest E until there are no more available vectors.
Realizing delay to buffers
Realizing delay on a vector is a mature question with many discussions. The majority of delay could be realized by inserting delay buffers and remaining by sizing cells, adding spare cells to loads, or bound dummy metal. The problem of deciding inserted buffers is transformed into ILP problem: Let f 1 ðvÞ; 2 ðvÞ; . . . ; n ðvÞg denote the delay of candidate buffers, fa 1 ðvÞ; a 2 ðvÞ; . . . ; a n ðvÞg denote the area of candidate buffers, and AðvÞ denotes the total introduced area, define:
PðvÞ ¼ The goal is to obtain a set of number fN 1 ðvÞ; N 2 ðvÞ; . . . ; N n ðvÞg that tries to satisfy PðvÞ and gives a minimum AðvÞ. For each of the assigned vectors, ILP procedures will be implemented successively and solved fleetly, because generally, there are no more than ten candidate buffers (n 10 in Eq. (25)).
Time complexity
In this part, we'll be using following notations: N: Number of vectors with negative hold slack in GðV; EÞ diagram; P: Number of vectors with assigned padding delay; F: Total number of fanin and fanout vectors visited from padded vectors during process Spreading-Forward and Spreading-Backward. In the stage of GðV; EÞ construction, information gathering, and EðvÞ calculation, every vector in the diagram needs to be visited once. The time complexity is OðNÞ. For padding and spreading, each vector with a padded delay triggers a procedure of spreading in its fanout and fanin cones, approximately every vector will be visited once. The time complexity is OðFÞ. In realizing delay stage, if it is assumed that ILP is solved in Oð1Þ, the time complexity is OðPÞ since there will be P times to 
Experiment results
Our experiment is implemented at SMIC 55 nm tech node. The first reference method is greedy heuristics in [5, 7, 8] (shortened as "ref " in following content). As concluded in [9] , "EDA tool innovation in the timing closure space has been impressive". Unlike paper algorithms' slow progress, commercial tools now can achieve complicated physical aware eco timing optimization with multi-scenario support, scales on to hundreds of threads to fasten runtime [11] . Under standard voltages, Primetime can also achieve minimum buffer insertion within a short runtime. So we choose Primetime as additional reference method ("pt"). Then comes our proposed greedy heuristic, which selects padding locations by the "effectiveness" function ("ours"). "ref " and "ours" are programmed using Tcl/Tk and built-in commands in Primetime.
The targeted circuits include several large-scale circuits in benchmark [12] , and our industrial test chip, with an embedded microprocessor under ultra-low voltage, and remaining peripherals under standard voltage. Circuit b12 is not divided into multiple voltage domains, in order to verify that our method makes differences under a single low voltage domain. The other circuits are divided into two domains. Low voltage domain is under ff_0p33v_85c for STC best condition and others under 1p32v supply. Besides, our chip was implemented both under ff_0p66v_85c for NTC best condition ("ind06") and ff_0p33v_85c ("ind03"). Based on our MonteCarlo simulation, we set 100% derate NTC corners, and 300% derate for STC corners. All circuits are post-routed in IC-Compiler [10] and then analyzed in Primetime with parasitics extracted in StarRC. Three methods are all implemented first, with only one iteration. Then buffers are inserted back in IC-Compiler and reanalyzed in Primetime. At this moment THS may still be unclear, it's because buffer delay may not be sufficient (if the smallest buffer delay is 0.2, 0.05 slack remains) and there are deviations between calculated buffer delay and realized delay. So a rounding-off stage with more buffers is implemented. Yet we don't conclude this part in comparison because it's not related to the methods. Finally, we record the results listed in Table II. Setup constraints take effects in all cases to maintain the original period. The results show that the minimum setup slack did not degrade afterward. Under NTC and STC corners, inserted area by "ref " method increased exponentially. More concretely, too many padding locations were chosen in high voltage domain. Take case ind06 as an example, illustrated in Fig. 4 (the inserted buffers are highlighted white), many of these locations required thousands of buffers to make up for hold slack. This lead to an explosion of total cell numbers. As for "pt", under STC corner, it still inserted hundreds of unnecessary buffers on average in high voltage domain. Besides, though most selections did occur in low voltage domain, the tool was unable to make the best choice. Our proposed method chose the locations with the best benefit both in area reduction and hold fixing, thus forbade to insert buffers in high voltage domain. Also from case b12, our method achieved better area overhead for single voltage domain. The area of inserted buffers overall had been reduced 91.6% on average compared with "ref ", and had been reduced 39.2% on average compared with "pt". Fig. 5 provides the relevance between runtime and F. As P ( N ( F (for example, in case ind03, N ¼ 148784, P ¼ 544, F ¼ 1273 k), the runtime is mainly contributed by delay assignment stage and is linear dependent to F. Results show that runtime of realizing delay to buffers can be omitted.
Conclusion
In this paper, we proposed an effectiveness-oriented greedy heuristic to overcome the drawback of prior short paths padding methods, that they caused a considerable amount of buffer overhead dealing with NTC or STC designs. Compared to prior proposals, our algorithm had shown to be more effective under ultra-low-voltage conditions, with 91.6% lesser area dissipation on average. There are more timing closure topics worth exploring under ultra-low supply voltage, such as clock skew issue, minimum clock period exploration, short padding problem for resilient circuit and multi-scenario consideration. These topics all have different situations compared with standard voltages, thus needs more targeted algorithms.
