Clock trees must be constructed to function even under the influence of on-chip variations (OCV). Bounding the latency of a clock tree, i.e., the maximum delay from the tree root to any sequential element, is important because the latency correlates with the maximum magnitude of the skews caused by OCV. In this paper, a latency constraint graph (LCG) that captures the latencies of a set of subtrees and the skew constraints between the subtrees is introduced. The minimum latency of a clock tree that can be constructed from the corresponding subtrees is equal to the (negative of the) length of a shortest path in the LCG, which can be computed in O(V E). Based on the LCG, we propose a framework that consists of a latency-aware clock tree synthesis (CTS) phase and a clock tree optimization (CTO) phase to construct latency-bounded clock trees. When applied to a set of synthesized circuits, the framework is capable of constructing latency-bounded clock trees that have higher yield compared to clock trees constructed in previous studies.
INTRODUCTION
With increasing impacts of on-chip variations (OCV), it is crucial to consider both timing constraints and latency when constructing clock trees for sequential circuits. Clock skew is the difference in the arrival time of the clock signal between a pair of sequential elements (or clock sinks). A clock tree must be constructed such that the clock signal is delivered meeting skew constraints even when the clock tree is under the influence of OCV. Earlier studies have focused on satisfying the skew constraints by providing guardbands to OCV by inserting safety margins in the skew constraints [15, 8] . However, by inserting safety margins in the skew constraints, * This research was supported by NSF awards CFF-1065318 and CFF-
1527562.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
ISPD '16, April 3-6, 2016, Santa Rosa, California, USA. the construction process becomes more constrained and the maximum delay from the root of the clock tree to any clock sink (maximum latency or simply latency in this paper) may become longer. This makes the constructed clock tree more susceptible to OCV. Typically, the propagation delay of a path in the clock tree correlates with the delay variations along the path. Therefore, by bounding the latency, the maximum magnitude of the skews caused by OCV is also bounded. Consequently, it is important to consider both safety margins and latency bounds when constructing clock trees.
In [15] , a clock tree meeting arbitrary skew constraints (with safety margins inserted) was constructed by iteratively joining two smaller subtrees to form a larger subtree. The skew constraints were satisfied by joining each pair of subtrees within a feasible skew range (FSR). In [8] , the Greedy-UST/DME algorithm in [15] was extended to enable larger safety margins to be inserted in the skew constraints. However, latency optimization was not considered in [15, 8] . Latency optimization has been considered indirectly during the construction of a clock tree in [3, 5, 11] and directly during clock tree optimization (CTO) in [12, 13] .
The limitation of [12, 13] is that the potential latency reductions may be limited because of the structure of the initial clock tree. The drawback of [15, 8, 3, 5, 11] is that latency is not considered explicitly in the tree construction process. It is important to realize that the minimum latency of a clock tree, constructed from a set of subtrees, is dependent on both the latencies of the subtrees and the skew constraints between the subtrees, as illustrated in Figure 1 .
Ideally, clock trees with an optimal latency and safety margin should be constructed. In this paper, we propose a framework to synthesize clock trees meeting a user-specified latency bound Luser while providing a user-specified safety margin Muser in the skew constraints. A main contribution is the introduction of a latency constraint graph (LCG). An LCG captures the latencies to the sequential elements, the skew constraints, and the skews committed in a tree construction process. The minimum (maximum) latency of a clock tree that is constructed from the sinks captured in the LCG is equal to the negative value of the length of the shortest path from a virtual source to a virtual sink in an LCG, which we refer to as the latency path. Both the latency path and the length of the latency path can be computed using the Bellman-Ford algorithm in O(V E) [6] .
The proposed framework consists of a latency-aware clock tree synthesis (CTS) phase and a CTO phase. In the latencyaware CTS phase, the construction of a clock tree is viewed as performing a series of delay insertions and skew commitments. By performing delay insertions and skew commitments within feasible delay insertion ranges (FDIRs) and feasible latency ranges (FLRs), respectively, it is ensured that the length of the latency path is bounded, which in turn bounds the latency of the clock tree. After the CTS phase is complete, CTO is performed to remove any remaining timing violations by realizing delay adjustments.
To the best of our knowledge, this is the first work that performs latency minimization while providing safety margin in the skew constraints. Moreover, by committing skews within the intersections of FSRs [15] and an FLRs, both arbitrary skew constraints and the latency bound are satisfied. Therefore, compared to [15, 8] , our framework is also capable of constructing clock trees with given latency bounds. Furthermore, we observe that CTO is more effective when applied to clock trees with shorter latencies. Hence, we say that our latency-aware CTS phase constructs clock trees that are amenable to CTO. On a set of synthesized circuits with up to 13, 216 sequential elements, we improve the yield to 100%.
PRELIMINARIES

Skew constraints
The clock signal delivered to the sequential elements must meet both setup and hold time constraints. The setup and hold time constraints between a launching flip flop FFi and a capturing flip flop FFj are formulated as follows:
where ti and tj are the arrival times of the clock signal at FFi and FFj; t
CQ i
is the clock to output delay of FFi; T is the clock period; t j , Eq (1) and Eq (2) are reformulated as follows, and referred to as skew constraints:
To satisfy the skew constraints under variations, the constraints are tightened with a uniform safety margin Muser as follows: 
The skew constraints are tightened with a uniform safety margin because it is hard to estimate the impact of OCV before an initial clock tree has been constructed in CTS. The skew constraints with safety margins in Eq (4) can be captured in a skew constraint graph (SCG). In an SCG G = (V, E), the vertices V represent the sequential elements and the edges E represent the skew constraints. The skew constraints with safety margins in Eq (4) are represented with an edge eij from vertex i to vertex j, with the weight wij = −(lij + Muser) and with an edge eji from vertex j to vertex i with a weight of wji = uij − Muser. The SCG of the skew constraints in Figure 1 
On-chip variations
After CTS has been performed, the negative effects of the OCV can be estimated using the timing and topology of the initial clock tree. Next, the estimates are used to evaluate performance and to guide CTO. The constraints in Eq (1) and Eq (2) are extended to include OCV as follows:
where δi and δj are the delay variations caused by OCV. Let the closest common ancestor (CCA) of FFi and FFj be denoted as CCA(i, j). The delay variations δi and δj are only accumulated on the paths from CCA(i, j) to FFi and FFj, respectively. By assuming that the delay variations on a path are proportional to the propagation delay of the path, the delay variations δi and δj are estimates as follows:
where t CCA(i,j),i and t CCA(i,j),j are the propagation delays from CCA(i, j) to FFi and FFj, respectively; cocv is a userspecified parameter. Consequently, a clock tree with a long latency is more susceptible to OCV variations compared with a clock tree with short latency, as illustrated in Figure 2 .
Using the estimates, the total negative slack (TNS) and the worst negative slack (WNS) in the constraints in Eq (5) and Eq (6) can be computed. Note that if the latency of the clock tree L is less than or equal to a latency bound
, it is ensured that δi + δj ≤ Muser is satisfied for each pair of flip flops FFi and FFj, i.e., TNS = 0 and WNS = 0. This holds because the delay variations δi + δj can be estimated with cocv · t CCA(i,j),i + cocv · t CCA(i,j),j and bounded by 2 · cocv · L because t CCA(i,j),i ≤ L, and
However, it is typically impossible, or very costly, to construct a clock tree meeting such stringent latency bounds. Therefore, we set the user-specified latency bound Luser to be a fraction of the latency of a clock tree constructed with no latency bound.
Problem definition
This paper considers a clock tree synthesis problem, i.e., constructing a clock trees that deliver synchronizing clock signals to sequential elements while satisfying skew constraints and transition time constraints under the influence of OCV, in order to obtain high yield. The objective is to minimize the power consumption of these clock trees.
Two key factors that influence yield are latency and safety margins. Large safety margins implies tighter skew constraints, which restrict the tree construction process and typically results in clock trees that have longer latencies; such clock trees are more susceptible to OCV. Therefore, we approach the synthesis problem by constructing latencybounded clock trees with uniform safety margin provided in the skew constraints. This problem has two inter-related components: (1) Determining the optimal values for the latency bound and the uniform safety margin, which are denoted as Lopt and Mopt, respectively. (2) Constructing a clock tree for some Lopt and Mopt.
We limit the scope of this paper to component (2) of this problem, i.e., constructing a clock tree for some given userspecified latency bounds Luser and uniform safety margins Muser. We rely on the user to provide appropriate values of Luser and Muser. In our future work, we will also investigate component (1) of the problem, i.e., how to determine Mopt and Lopt, and the problem as a whole.
The Greedy-UST/DME algorithm
Using the Greedy-UST/DME algorithm [15] , a clock tree can be constructed meeting the constraints captured in an SCG, i.e., the skew constraints with safety margins in Eq (4). The algorithm constructs a clock tree meeting arbitrary skew constraints by iteratively merging a pair of smaller subtrees to form a larger subtree. In [15] , it was shown that if two subtrees were merged within a feasible skew range (FSR), all the skew constraints would be satisfied. An FSR is computed by finding two shortest path in an SCG. Specifically, the FSR between two subtrees i and j is:
where, dij and dji are the shortest paths from vertex i to vertex j and the shortest path from vertex j to vertex i in the SCG, respectively. Later, when the merging location is decided and the two subtrees are joined physically, the FSR is narrowed to a specific skew value, committing the skew. The skew commitment requires two edges to be added to the SCG, an edge eij with a weight of wij = −a and an edge eji with a edge weight of wji = a, are added.
LATENCY CONSTRAINT GRAPH
We introduce the concept of a latency constraint graph (LCG) as follows. First, we define the term sink latency. For each respective sink, the sink latency is equal to the delay from the root of a subtree where the sink resides to the respective sink. An LCG G = (V, E), is an extension of an SCG. An LCG consists of the vertices and edges of an SCG with the addition of a virtual source and a virtual sink. An LCG for the subtrees in Figure 1 (a) is shown in Figure 3 . We refer to the original vertices of an SCG as sink vertices. As illustrated in Figure 3 , the virtual source is connected to each of the sink vertices with a directed edge with a weight of 0. Moreover, each sink vertex is connected to the virtual sink with an directed edge with a weight equal to the negative value of the respective sink latency.
Latency path: We call the shortest path from the virtual source to the virtual sink the latency path and the length of the latency path is denoted as s p . The minimum (maximum) latency of a clock tree that is constructed from the sinks captured in the LCG is equal to −s p . Both the latency path and the length of the latency path can be found in O(V E) using the Bellman-Ford algorithm [6] . Consequently, the subtrees in Figure 1 can be joined to a clock tree with a latency of 65 because −s p = 65 in Figure 3 . To find a clock tree with the latency of −s p , we introduce the reversed graphḠ of the LCG G. In the reversed graphḠ, all the edges in G are reversed, i.e., an edge from vertex i to vertex j in G is equivalent to an edge from vertex j to vertex i inḠ. Moreover, the top-down shortest path from the virtual source to a vertex i in G is denoted s t i and the bottomup shortest path from the virtual sink to a vertex i inḠ is denoted s b i . Interestingly, a clock tree constructed such that the latency to each sink is equal to the bottom-up shortest path to the respective sink vertex satisfies both the latency requirement and the skew constraints. Consequently, in the example, the latency to the sequential elements in such a clock tree would be [−s
Updating edge weights in LCG
The construction of a clock tree can be viewed as a series of delay insertion and skew commitment operations that correspond to both physical commitments in a tree construction process and an equivalent modifications of edge weights in the LCG and its reversed graph. The two operations are outlined as follows:
Delay insertion: This is the process of inserting a piece of wire or a buffer at the root of a subtree. In the LCG, the delay insertion increases the latency to all the sinks of residing in the subtree, which necessitates a reduction of the corresponding latency edge weights. (Without loss of generality, every subtree can be captured using a single rep-resentative sink. Therefore, a delay insertion is equivalent to the reduction of a single edge weight.)
Skew commitment: This is the process of joining two subtrees i and j to a single subtree. As in an SCG, if a skewij = a is committed, two edges are required to be added to the LCG, an edge eij with a weight of wij = −a and an edge eji with a edge weight of wji = a. Again, the added edge weights always correspond to a reduction of the edge weights.
In summary, both the delay insertion and skew commitment operations result in reductions of edge weights in the LCG. Moreover, the reductions may reduce the length of the latency path. To ensure that all subtrees can be joined to clock tree meeting a user-specified latency bound Luser, we introduce a feasible delay insertion range (FDIR) and a feasible latency range (FLR). If each delay insertion or skew commitment is performed within an FDIR or an FLR, respectively, −s p ≤ Luser is ensured, and consequently the latency bound is satisfied.
Derivation of FDIRs and FLRs
Consider a delay insertion or a skew commitment operation that reduces the edge weight wij to wij. Assume that before the operation, −s p ≤ Luser is satisfied. By reducing the edge weight wij, the latency path can only be reduced if the edge eij is part of the latency path. The shortest path from the virtual source to the virtual sink using the edge eij is denoted s 
The maximum delay insertion ∆i at the root of a subtree i is equal to the slack in Eq (10) before the edge weight reduction, i.e., Luser + s t i + s b j + wij. Consequently, the FDIRi for a delay insertion ∆i is formulated as follows:
where w i,vsink is the weight of the edge from the sink vertex i to the virtual sink before the delay insertion, and s t i is the top-down shortest path to the sink vertex i and s b j = 0 since vertex j corresponds to the virtual sink.
Next, we formulate the FLRij for a skew commitment skewij. Recall that a skew commitment results in the addition of two edges eij and eji with weights wij = −skewij and wji = skewij, respectively. Using the constraints in Eq (10), FLRij is obtained as follows:
In Eq (12) s t i and s b j are the shortest paths to the sink vertices i and j with the edges eij and eji removed from G and G. This modification is required because both edge weights wij and wji are reduced with the same skew commitment.
Finally, to satisfy both skew constraints and latency bound, a concept of a feasible skew-latency range (FSLR) is introduced, which is the intersection of an FSR in Eq (9) and an FLR in Eq (12) as follows:
It can be proved that if both FSR and FLR are non-empty, FSLR is also non-empty. In short, the statement holds be- cause the bottom-up shortest path to each sink vertex defines a latency assignment to the sinks that is by definition a part of both the FLR and FSR. Note that the use of FDIRs and FLRs is orthogonal to the use of FSR in [15] . Consequently, if the latency bound is loose, our proposed approach is equivalent to the Greedy-UST/DME algorithm [15] .
PROPOSED SYNTHESIS FLOW
In this section, we present the proposed framework for synthesizing latency-bounded clock trees. The framework consists of a latency-aware CTS phase and a CTO phase. The innovations proposed in the paper focus on the latencyaware CTS phase. The CTO phase an adaptation of algorithms proposed in previous studies.
Latency-aware CTS phase
The latency-aware CTS is based on a traditional bottomup tree construction flow that constructs a clock tree buffer stage by buffer stage, alternating between merging subtrees [15] and inserting buffers [4, 2, 8] . This "core flow" is illustrated with solid boxes in Figure 4 and consists of "merging" and "buffer insertion".
Merging: The process is based on the Greedy-UST/DME algorithm [15] using a nearest neighbour graph (NNG). In an NNG, each subtree is represented with a vertex and each edge represents the wiring cost of merging two subtrees. Iteratively the two subtrees connected with the least cost edge are attempted to be merged within an FSR. If the two subtrees connected with the least cost edge can be merged within an FSR while meeting a transition time constraint, the two subtrees are replaced by a larger subtree, which is reinserted into the NNG. Otherwise, the two subtrees are locked from further merging. After all subtrees are locked from further merging, a buffer is inserted at the root of each of the subtrees in the buffer insertion step.
Buffer insertion: A minimally sized buffer that can drive each respective locked subtree while meeting the transition time constraint is inserted at the root of the respective subtree, as in [4, 2, 8] .
Assume that a set of subtrees has been constructed from the "merging" and "buffer insertion" processes. In addition, assume that the roots of these subtrees are located at the same spatial location. In such a situation, the LCG can be utilized to merge subtrees into a clock tree with a latency of −s p and the latency to each of the sequential elements would be equal to (the negative of) the bottom-up shortest path to each respective sink vertex in the LCG. Consider subtrees in Figure 1(a) with the LCG in Figure 3 , one way to construct such a clock tree would be to insert the delay difference between the final latencies [65, 55, 40] and the current latencies [5, 20, 40 Figure 1(b) . This method of joining the subtrees, based on the LCG, is referred to as root construction; it is labeled as "root construction" in Figure 4 and is further explained in Section 4.1.1.
However, to utilize the root construction, all subtrees are required to be located at the same spatial location. To satisfy this requirement, a root container is introduced. (In our implementation, the root container is located at the center of the circuit.) After the "buffer insertion" step in the core flow, it is checked, for every subtree, whether the subtree can reach the root container by inserting a stem wire below the newly inserted buffer. (A stem wire is a wire that is connected between the root of a subtree and the respective driving buffer [4] ). If so, the subtree is routed to and placed in the root container and removed from the core flow. Otherwise, the subtree continues to be part of the iterative merging and buffer insertion process. After all subtrees in the core flow are located in the root container, the proposed root construction is applied.
Next, we propose to supplement the core flow with a latency-aware subtree construction technique. This technique is to overcome the limitation that −s p ≤ Luser may not be satisfied at the beginning of the root construction. Therefore, to ensure that −s p ≤ Luser holds at the beginning of the root construction, we propose to incorporate techniques called "Merging based on FSLRs", "Latency locking based on FDIRs", and "Subtree dragging based on FDIRs", which are illustrated in Figure 4 and explained in Section 4.1.2.
Root construction
The input to the root construction is a set of subtrees located at the same spatial location (in a root container). The output is a clock tree with the latency to each sequential element equal to the (negative value of) the bottom-up shortest path to each sink vertex. We explain the root construction in Figure 5 , by illustrating how the subtrees in Figure 1(a) , with the LCG in Figure 3 , are joined to form a clock tree in Figure 1 A naive solution would be to realize each of the delay insertions separately, i.e., a delay ∆1 = 60 and ∆2 = 35 could be added to the subtrees 1 and 2 respectively, before joining the three subtrees at the root.
Instead, we propose a method of maximal sharing of the required delay insertions. The delay insertions ∆ are sorted in a decreasing order, i.e., [60, 35, 0] . This sorted order defines both the merging order and the sharing of the delay insertions. The subtrees are merged from left to right corresponding to the sorted order of the delay insertion, i.e., subtree 1 is merged with subtree 2 and the resulting subtree is merged with subtree 3. The delay insertions are shared with all subtrees located to the left in the sorted delay inser- tion array, i.e., the delay 35 is shared among both subtree 1 and 2. By using the maximal sharing of the delay insertions, we expect the cost to be reduced compared with the case when root construction is not applied. Moreover, yield may also improve since the sinks will be closer in the tree topology. Furthermore, the delay insertions ∆ are realized imperfectly to simplify root construction process. We restrict the delay realization to be performed by buffer insertion. Therefore, the delay insertion granularity is limited by the buffer library. Here, instead of realizing each delay insertion precisely, we attempt to realize each delay insertion as close to but less than or equal to, the specified delay insertion as possible. The motivation to realizing delays imperfectly is that the CTO process can realize more precise delay adjustments after CTS is complete. Since the timing and the topology of the initial clock tree is available during CTO, delay adjustments can be realized with a finer granularity while considering more accurate OCV estimates.
It is expected that the imperfect root construction may affect the yield after CTS. If inadequate safety margins are provided in a single skew constraint, the constructed clock tree may suffer substantial yield loss. However, after CTO, the yield should be recovered.
Latency-aware subtree construction
The root construction step is capable of joining the subtrees to a clock tree with a latency of −s p . The techniques in this section are designed to ensure that −s p ≤ Luser is satisfied at the beginning of the root construction.
It is straightforward to only allow subtrees to be merged within FSLRs introduced in Eq (13), ensuring that −s p ≤ Luser is satisfied after two subtrees are merged. However, with only that extension of the core construction flow, subtrees with long latencies that are spatially distant from the root container may be created. Moreover, when these subtrees are routed to the root container, the constraint −s p ≤ Luser may be violated.
To account for the spatial location of a subtree, the concept of virtual subtree latency is introduced. The virtual sink latency of a subtree is equal to the delay that is inserted if a subtree is routed (or dragged) to the root container by connecting a wire from the root of the subtree to the location of the root container (with buffers inserted on the wire to control the transition time). Next, the latency edge weights are modified to be the negative value of the sum of the sink latency and the virtual subtree latency.
With the inclusion of virtual latencies, a clock tree with a latency of Luser can be constructed by dragging each subtree to the root container and then joining them at the root container. When dragging the subtrees to the location of the root container, the virtual latency would be replaced with "real" latency and the weight of each latency edge would remain the same. Consequently, the length of the latency path s p would also remain the same, and a clock tree with a latency of −s p ≤ Luser can be constructed. However, routing each of the subtrees (or sinks) to the location of the root container directly would be very costly, in terms of wire and buffer resources. Therefore, we do not drag subtrees to the root container directly but instead continue to merge subtrees while ensuring that −s p ≤ Luser. Only after a subtree has grown beyond a threshold, it is dragged to the root container. The core flow is modified as follows:
Merging based on FSLRs: The "Merging" in the core flow is modified to merge subtrees within FSLRs in Eq (13) instead of FSRs. By merging subtrees within FSLRs it is ensured that both skew constraints and the latency bound are satisfied.
Subtree locking based on FDIRs: If a subtree has long latency or is distant from the root container, it may be costly to merge it with another subtree. Therefore, we lock such subtrees from further merging and drag them towards the root container in the subtree dragging step. A subtree is locked if the upper bound of its FDIR (see Eq (11)) is less than a parameter p lock = 20 ps. This condition is checked in the merging process after each new subtree is formed.
Subtree dragging based on FDIRs: The subtree dragging is applied after the buffer insertion step. It is applied to every subtree whose upper bound of its FDIR (see Eq (11)) is less than a parameter p drag = 20 ps. The driving buffer is sized up to the next driver size and a piece of stem wire is inserted between the buffer and the subtree. The stem wire is elongated to the maximum length while satisfying a transition time constraint. Next, the buffer is dragged as close to the root container as possible using the stem wire, transferring virtual latency to real latency.
Clock tree optimization (CTO) phase
The CTO phase is based on the techniques proposed in [9] . The optimization aims to remove timing violations (or negative slacks, i.e., TNS and WNS) in a clock tree by realizing non-negative delay adjustments. The location and the magnitude of the delay adjustments are determined by solving an LP formulation. Next, the delay adjustments are realized by inserting buffers and detouring wires [10] .
Note that delay insertions that are realized imperfectly during the root construction can be realized as delay adjustments during CTO process, if required. Moreover, it is important to observe that if the latency of the initial clock tree is long, large delay variations are present in many skew constraints and there may be little room for CTO to improve performance. Therefore, we say that by bounding the latency of a clock tree during latency-aware CTS, the clock tree is more amenable to CTO.
EXPERIMENTAL EVALUATION
The proposed algorithms are implemented in C++ and the experiments are performed on a quad core 3.10 GHz Linux machine with 7.7 GB of memory.
To evaluate our proposed framework, the extension [8] of the problem formulation in the ISPD 2010 contest [14] is used. A summary of the properties of the synthesized circuits that are used in the evaluation are shown in Table 1 [7] . In [8] , a Monte Carlo framework is used to evaluate the robustness of a clock tree subject to voltage (15%), wire width (10%), temperature (30%), and channel length variations (10%). The variations are generated using a quad-tree model [1] to exhibit spatial correlation. The robustness of a clock tree is measured by simulating it with 500 Monte Carlo simulations. In each simulation, it is checked whether all the skew constraints (see Eq (5) and Eq (6)) and a transition time constraint are satisfied. The transition time states that the 10% to 90% rise or fall time of the clock signal at any point in the clock tree must be below 100 ps. The quality of a clock tree is measured in terms of yield. The yield is defined to be the number of simulations with no constraint violations divided by the total number of simulations. The power consumption is estimated with capacitive cost.
To clearly demonstrate the impact of each optimization step we show the performance and cost of the constructed clock trees after both CTS and CTO. Even though the evaluation is performed in terms of yield, we show the TNS and WNS estimates that are used to guide the CTO process. It is expected that if the TNS and WNS are zero, or small, the yield of the clock tree will be high when evaluated by the Monte Carlo framework. However, there is no guarantee that a clock tree with zero TNS and WNS will have 100% yield or that a clock tree with non-zero TNS and WNS will suffer yield loss. The parameter cocv = 0.085 is used based on statistics obtained through circuit simulations.
We construct and evaluate several different tree structures to show the impact of our proposed optimization techniques. The structure "Tree" is a clock tree constructed using only the traditional core flow, i.e., using merging and buffer insertion. The structure "R-Tree" is the core flow with the addition of the root construction. The structure "L-R-Tree", is the structure obtained by the complete flow, i.e., the core flow with the latency-aware subtree construction techniques and the root construction. After the different tree constructions have been performed, CTO is performed. We also report the results of the most competitive clock trees constructed in [8] . The construction process of clock trees in [8] is similar to that of our "Tree" structure. However, CTO is not performed in [8] . To enable a more fair comparison, we perform CTO on the "Tree" structures in [8] . Table 2 shows the experimental results on the circuits scaled s15850, ecg, and aes. We construct the "Tree" structures and "R-Tree" structures with different safety margins Muser and compare the performance after CTS and after CTO. We have also included relevant results from [8] .
Evaluation of root construction
We observe the same trends as in [15, 8] for the "Tree" structures constructed with different safety margins Muser after CTS. The performance in TNS, WNS, and yield is improved when larger safety margins are inserted in the skew constraints. For example, on ecg, a yield of 92.8% (97.4% and 99.6%) is obtained when a safety margin of 20 ps (30 ps and 40 ps) is inserted. However, as larger safety margins are inserted, the construction process becomes more constrained and it can be observed that the capacitive cost and latency increase from 32.4 pF and 344 ps to 62.4 pF and 850 ps, respectively. Next, to demonstrate the importance of considering latency during CTS, we discuss the difference in performance between the "Tree" and the "R-Tree" structures. Both structures are constructed based on a user-specified safety margin Muser. However, the "R-Tree" structures also try to minimize latency (although no latency bound is required).
After CTS, it can be observed that the "R-Tree" structures have 15% shorter latency and 7% lower capacitive cost compared with the "Tree" structures on the average. The latency improvements are a consequence of using the LCG to join the subtrees in the root container to a single clock tree. Moreover, it is likely that the reductions in capacitive cost mostly stem from the the sharing of delay insertions during root construction, and because of the imperfect root construction.
The "R-Tree" structures also outperform the "Tree" structures in terms of TNS and WNS. This is expected because the delay variations introduced by OCV are smaller in clock trees with shorter latencies. However, on a few circuits, the "R-Tree" structures obtain worse TNS and WNS; we believe this is because of the imperfect delay insertion during root construction. For example, on circuit aes, with Muser = 60 ps, the "R-Tree" structure has worse TNS and WNS.
Even though the "Tree" structures have worse performance in TNS and WNS, it is expected that the yield in the Monte Carlo evaluation is better compared with the "R-Tree" structures. This is because if a single skew constraint is violated as a result of the imperfect root construction for the "RTree" structures, yield loss is suffered. (However, we expect this yield loss to be recovered after CTO.) Next, we focus our attention on the experimental results obtained after CTO. For the "Tree" structures, it can be observed that on many circuits, CTO is unable to improve the performance in TNS and WNS. As mentioned, this may be because the "Tree" structures have long latencies and there is no room for further optimization. On circuits where CTO is unable to improve or close timing in terms of TNS and WNS, it can be understood that the yield may be worse after CTO because the safety margins in the skew constraints may have been redistributed unevenly, and inadequate safety margins in a single skew constraint may result in substantial yield loss.
On the other hand, the CTO process is capable of removing or reducing TNS and WNS when applied to the "R-Tree" structures. Therefore, we say that our framework constructs clock trees that are more amenable to CTO. Moreover, the capacitive overhead of performing CTO is small on the average; only an 2.4% increase in capacitive cost is observed. Since significant reductions in TNS and WNS are obtained, it is not surprising that the yield of the "R-Tree" structures is higher after CTO compared with after CTS. Moreover, we also note that the "R-Tree" structures have higher yield after CTO compared with the "Tree" structures after both CTS and CTO. This confirms the importance of considering both latency and safety margins during CTS. Nevertheless, if the initial clock tree is constructed with a too small safety margin Muser, CTO may not be able to achieve timing closure, and the yield may be inadequate (See Muser = 20 ps and 20 ps for scaled s15850 and ecg, respectively.)
Evaluation of latency-bounded clock trees
We are interested in improving yield on the circuits, ecg, aes, and scaled s15850. As observed in Table 2 , the "RTree" structures performed better compared with the "Tree" structures mainly because of their shorter latencies. There- fore, we speculate that we may be able to improve performance by combining the use of safety margins and latency bounds. Moreover, potentially, by imposing a latency bound, a smaller safety margin Muser can be used, which may translate into savings in capacitive cost. In Table 3 , we construct three versions of the "L-R-Tree" structure on each circuit, each with the same safety margin Muser but a different latency bound Luser. The safety margin Muser is set to 25 ps, 30 ps, and 50 ps because these safety margins seem to provide a reasonable starting point in yield and capacitive cost (see the relevant results for "R-Tree" in Table 2 , which are repeated in Table 3 ).
In Table 3 , we observe that the framework is capable of constructing clock trees meeting specified latency bounds. For example, on ecg and aes, the "R-Tree" structures have latencies of 382 ps and 2207 ps, respectively; the "L-R-Tree" structures have latency ranges of 265 ps to 318 ps and 1483 ps to 1863 ps, respectively. Moreover, it is observed that it is not too costly to reduce the latency quite substantially. The capacitive overheads are at most 15%.
We observe that the performance in TNS and WNS after CTS is inconclusive. On some circuits and for certain latency bounds, the performance of the "R-Tree" structures is better and for others, "L-R-Tree" structures have better performance. This may be because of the imperfect root construction that is applied in the construction of both the "RTree" and the "L-R-Tree". However, after CTO, the "L-RTree" structures seem to obtain better performance in TNS and WNS, if a Luser is set to an appropriate value.
Moreover, the yield of the "L-R-Tree" structures is better compared with the "R-Tree" structures. On scaled s15850 and ecg, the yield is better after CTO compared with after CTS, as expected. However, on aes, it is observed that the yield is actually better after CTS compared with CTO, even though the performance is better in TNS and WNS after CTO. We believe that this may be because the performance in TNS and WNS does not correlate perfectly with yield performance. However, we still believe that TNS and WNS are useful metrics for optimization.
Based on the results in Table 2 and Table 3 , it seems essential to combine the use of safety margins and latency bounds to be able to construct clock trees with high yield. Only the "L-R-Tree" is capable of obtaining a yield of 100% (after CTS or CTO) on all the circuits. Moreover, the "L-R-Tree" structures are cheaper in terms of capacitance compared to the clock trees constructed with larger safety margins Muser in Table 2 .
SUMMARY AND FUTURE WORK
We have introduced the concept of an LCG that captures, skew constraints, latencies, and skews committed in a synthesis process. Based on the LCG we have proposed a framework that constructs latency-bounded clock trees given a user-specified latency bound Luser and a user specified uniform safety margin Muser. In the future, we will solve component (1) of the problem defined in Section 2.3, i.e., determining appropriate Muser and Luser, and the problem as a whole.
