The clock trees of high-performance synchronous circuits have many clock logic cells (e.g., clock gating cells, multiplexers and dividers) in order to achieve aggressive clock gating and required performance across a wide range of operating modes and conditions. As a result, clock tree structures have become very complex and difficult to optimize with automatic clock tree synthesis (CTS) tools. In advanced process nodes, CTS becomes even more challenging due to on-chip variation (OCV) effects. In this paper, we present a new CTS methodology that optimizes clock logic cell placements and buffer insertions in the top level of a clock tree. We formulate the top-level clock tree optimization problem as a linear program that minimizes a weighted sum of timing slacks, clock uncertainty and wirelength. Experimental results in a commercial 28nm FDSOI technology show that our method can improve post-CTS worst negative slack across all modes/corners by up to 320ps compared to a leading commercial provider's CTS flow.
INTRODUCTION
In a modern SOC, clock logic cells (CLCs), such as clock gating cells (CGCs), multiplexers (MUXes) and dividers (DIVs), are required in the clock tree to achieve different performance and power saving requirements. To enable multi-mode operation and dynamic voltage frequency scaling (DVFS), large numbers of clocks are generated to drive flip-flops (FFs) in an SOC. 1 Balancing the clock trees of multiple clocks is challenging because timing constraints depend on clock periods, and on the process, voltage and temperature (PVT) corners. Furthermore, as on-chip variation (OCV) increases, clock uncertainties (derates) on the launch and capture paths can increase. Clock tree synthesis (CTS) must find optimal branching points in the clock tree to minimize clock uncertainties due to OCV on non-common paths [9] [16] [17] . Figure 1 (left) illustrates the clock balancing problem due to CLCs in a clock tree and the impact due to OCV. Due to the CLCs, the clock arrival times at FF groups are skewed. Moreover, the clock tree splits near the clock source; this leads to long non-common paths between the FF groups. As shown in Figure 1 (right), we can insert buffers to balance the clock, and optimize placement of the CLCs to reduce the non-common paths.
Motivation for Clock Tree Optimization
Given a clock tree, we represent the top-level clock tree as a hypergraph, G top (V top , E top ), in which V top is a set of CLCs and the transitive fanin cells of the CLCs. E top is a set of nets that connect the cells in V top . Figure 2 shows a top-level clock tree 1 Both synchronous and asynchronous clocks can exist in an SOC. Our work focuses on balancing synchronous clocks in an SOC.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. with a CLC and three bottom-level buffered clock trees. In most cases, sophisticated EDA tools and CTS algorithms are able to achieve good solutions for the bottom-level clock trees. However, achieving a good solution for the top-level clock tree can be problematic when there are critical paths across the FF groups between different bottom-level clock trees. The requirements to balance the top-level clock are not obvious due to the complex structure of the tree (see Figure 6 ). Fixing the critical paths across the FF groups can be difficult at the bottom-level clock trees due to tight timing constraints among FFs within the same group. To optimize timing across FF groups, we propose to balance the toplevel clock tree while preserving the bottom-level clock trees. For example, in Figure 2 , if we increase the delay d(1, 2) on the net between pins 1 and 2 from 2ns to 4ns, we can change the skew between FF groups 1 and 2 from 2ns to 0ns, thereby meeting the timing target of critical path A which has a clock period of 3ns. Note that varying the delay on the top-level clock tree does not affect critical path B (but, the OCV derating on a longer top-level path will be larger), which has both its launch and capture FFs in the same group. Therefore, we only need to consider the requirements to balance clock across FF groups, thereby simplifying the toplevel clock tree optimization problem. Since problems arise in the top-level tree due to CLCs, our work focuses on optimizing the placement of CLCs and insertion of buffers in the top-level clock tree.
Previous Work
Rajaram and Pan [16] propose CTS algorithms to optimize the chip-level clock tree across different PVT corners. They use quadratic programming to reallocate clock pins of IP blocks to reduce non-common paths in the chip-level clock tree. After clock pins are reallocated, buffers are inserted up to each pin, and subtrees are merged recursively in the same manner as the deferred-merge embedding (DME) algorithm [6] . The algorithm only inserts buffers that minimize the difference in clock latency among subtrees across PVT corners. Although the chip-level CTS work in [16] accounts for delay variation across PVT corners and timing penalty on non-common paths, it does not consider CLCs, timing between FF groups, or wirelength, all of which make CTS a challenging task. As illustrated in Figure 1 , the placement of CLCs should also be considered during CTS as it can significantly affect the non-common paths in the tree. Other works [20] [18] seek to minimize the effect of OCV during CTS, but do not address the issues of CTS with CLCs across multi-corner and multi-mode (MCMM) scenarios. Lung et al. [12] propose a linear programming (LP) based clock skew optimization [8] which accounts for delay variation across PVT corners. They also present a method to map the required delays obtained from the LP to actual circuits. While mapping delays, they use updated timing information to dynamically adjust buffer delays. Although this work addresses the MCMM clock skew minimization problem, it does not consider the effects of non-common paths and CLC placement. There are many previous works on buffer insertion for CTS (e.g., [1] [4]), but they do not consider clock trees with CLCs which have different timing requirements depending on the operating modes and FF groups. Papa et al. [15] 
Our Work
To address the top-level CTS problem mentioned above, we propose a new CTS flow that accounts for the effects of CLCs as well as delay variations due to MCMM and OCV. The basic idea of our approach is to automatically identify the requirements to balance clocks based on the timing critical paths and use them to drive the CTS. The flow shown in Figure 3 starts with a placed design and performs conventional CTS to obtain a clock tree. We then extract the top-level clock tree (see Algorithm 1) and remove buffers in the top-level clock tree. Within the remaining (bottomlevel) clock trees, we extract timing-critical FF-to-FF paths to identify the timing requirements for clock balancing. Based on these requirements, we construct a linear program (LP) to optimize the placement of CLCs and the delay on nets (achieved by inserting buffers) in the top-level clock tree. Unlike the routing algorithm proposed by Oh et al. [13] which minimizes the total wirelength of a routing tree, we include CLCs and Steiner point locations as variables in the LP, so that the LP-based optimization can account for the cost of non-common paths. With the physical locations of CLCs and Steiner points of the routes, we insert buffers in the toplevel clock tree, legalize the placement and route the clock tree. The advantages of our methodology are as follows.
• Preserving the bottom-level clock trees affords more accurate timing information for the top-level clock tree optimization. 2 • Since the top-level clock tree has many fewer instances, we can perform runtime-intensive optimizations which cannot be practically applied to the bottom-level clock tree.
• Introducing our new top-level clock tree placement optimization enables fixing of suboptimal CLC placements which have already been determined during the preceding placement stage.
• Buffer insertion and CLC placement optimization can achieve reductions of non-common path timing penalties, which are not achievable using local/incremental optimizations. The key contributions of our work are summarized as follows.
• We propose a new automated clock tree synthesis methodology that optimizes the CLC placements and buffer insertion in the top-level clock tree.
• We propose an LP-based clock tree optimization method which accounts for routing resources (i.e., wirelength), circuit timing and the impact of non-common paths.
• Our method improves WNS by up to 320ps, and reduce the top-level clock wirelength by up to 50% compared to a default CTS flow.
• As part of our validation process, we develop generators for testcases that represent clock tree structures typically found in high-speed IPs (e.g., graphics accelerators) and real-world SOCs. In the remainder of this paper, Section II describes our toplevel clock tree optimization methodology. Section III describes experimental setup and our experimental results. In Section IV, we summarize our work and outline directions for future research.
CLOCK TREE OPTIMIZATION
We now explain the top-level clock tree optimization problem and our approach. In the following, we use condition, k, to denote that a timing value is specific to a PVT corner, clock group and timing analysis type (setup or hold). For example, with two PVT corners, two operating modes, two clock groups and two timing analysis types, k will range from 1, 2, ..., 16.
Problem Statement
Formally, the top-level CTS problem is defined as follows. Objective: Minimize the weighted sum of (i) worst negative slack, (ii) total negative slack (TNS), (iii) clock uncertainty and (iv) wirelength of a clock tree [16] . Input: Placed design; list of CLCs; timing constraints (.sdc). Output: An optimized placement of CLCs and clock buffers, clock routing of the top-level clock tree.
We model the cost of clock uncertainty Z k (a, b) on a critical path between FFs a and b as the sum of delays of the non-common launch and capture clock paths in the critical path. The noncommon path delays are normalized to the clock period (CP) of the path using factor α k .
where h a denotes a launch/capture path from a clock source to FF a, and d k (i, j) is the delay between pin i and j.
Our Approach
We formulate the top-level clock tree balancing problem as a linear program by assuming that we can vary (i) the delay d re f (i, j) from an output pin i to its fanout input pin j at a reference condition; 3 (ii) locations of CLCs; and (iii) Steiner points in the clock net (for a given topology). Although wire delay is normally nonlinear with respect to wirelength, we approximate d re f (i, j) as a linear function of distance between pin i and j assuming buffer insertion (as noted in, e.g., [15] , the delay of a net with uniformly spaced buffers is linearly proportional to the number of stages). 4 The main objective of the LP is to minimize the weighted sum of worst negative slack S wns , the total negative slack S tns , noncommon paths, Z k (a, b), and total wirelength U(i, j). 5 Note that we weight the Z k (a, b) proportional to its original negative slack (i.e., 1 − s 0 k (a, b)) such that the LP focuses on reducing the noncommon path delay on timing paths. The critical paths and their original slacks s 0 k (a, b) are extracted after the buffer removal step in Figure 3 by performing static timing analysis (STA).
To represent negative slack s k (a, b) in the LP, we use Constraints (3) and (4) such that s k (a, b) = 0 when s k (a, b) > 0. S wns and S tns are defined in Constraints (5) and (6), respectively. Since circuit designers may treat hold and setup slacks differently, we use a weight γ k ≥ 0 to set the ratio of importance (i.e., normalization ratio) of setup and hold slacks. The value of γ k can be different for hold or setup analysis, as indicated by the condition k. We represent the timing slacks s k (a, b) for each timing-critical path between FFs a and b as a function of the original slack, original clock skew λ k (a, b), and the clock arrival times (t re f (a)) in Constraint (7). Because delay and slack vary according to PVT corners and timing analysis type, we normalize the slacks across different conditions to a reference corner by using scaling factors η k , following the approach in [12] . ζ = 1 if the path is a setup-critical path and ζ = −1 if the path is a hold-critical path. t re f (a) is the sum of delays along the path h a (Constraint (8)).
Objective:
Min −w wns · S wns − w tns · S tns + w wl · ∑ e(i, j)∈E top
The values of λ k (a, b) and the cell delays in d re f (i, j) are constants in the LP, and are extracted from STA reports after the buffer removal step in our flow. In Constraint (9), we model the delay d re f (i, j) between pins i and j as a linear function of the Manhattan distance U(i, j) between the pins. β re f is a conversion factor to convert the Manhattan distance to delay at the reference condition. We obtain the value of β re f using the optimal repeater length method in [2] . The value of β re f is 30ps per 100µm for a 8X buffer in the 28nm foundry FDSOI standard cell library that we use in our experiments. We calculate Z k (a, b) in Constraint (10) . The Manhattan distances are calculated by using Constraints (11)- (13) . The location of a pin i is specified by variables p x (i) and p y (i), which represent the x and y coordinates of the pin. The bounds for p x (i) and p y (i) are specified in Constraint (17) . F x and F y are the upper bounds for the pin coordinates along the x and y axes, i.e., the dimensions of the design's floorplan.
To avoid unnecessary cell displacements, we add a displacement cost M(i, i 0 ) in the objective function [15] . The displacement cost is defined as the sum of Manhattan distances between the original cell locations ([p x (i 0 ), p y (i 0 )]) and their corresponding cell locations ([p x (i), p y (i)]) after optimization. M(i, i 0 ) is calculated using Constraints (14)- (16) . Since the displacement cost will force the LP to "pull" the cells to their original locations, we use a very small weighting factor (w dis = 0.001) as the cell displacement cost. We apply uniform weights for TNS and non-common path delays, i.e., w tns = 1, w ncp = 1. Since the typical values of total wirelength in a top-level clock tree is much larger than the timing slacks we set w wl = 0.001 such that the cost in the LP is not dominated by the wirelength. Figures 4(a) and 4(b) respectively show the setup and hold WNS (both normalized to their corresponding clock periods) obtained by solving the LP for different values of γ k . As we sweep γ k from 1 to 10, the setup WNS obtained from the LP improves but the hold WNS worsens. When we sweep the w wns /w tns ratio, the setup and hold WNS are not affected when γ k ≤ 3. However, when γ k > 3, the cost in the LP is dominated by the setup WNS and increasing the w wns /w tns ratio will improve the setup WNS. Since the hold time violations are relatively easy to fix by inserting buffers, we prioritize setup slacks when we select the γ k and w wns /w tns weight ratios. In our experiments, we use γ k = 5 and w wns /w tns = 2000 because we experimentally observe that by increasing γ k further does not improve the setup WNS but makes hold WNS worse (black arrow in Figure 4(b) ). We use the same values of the weighting factors across all testcases. It is also possible to apply different combinations of values of weighting factors, run the flows in parallel, and choose the best CTS solution.
Implementation Heuristics
Given a design with an initial clock tree, G(V, E), and a subset of vertices V CLC ⊆ V corresponding to CLCs, we extract the toplevel clock net using Algorithm 1. 6 First, we create a list V top of all transitive fanin cells of the CLCs. In Lines 2-4, we remove all the clock routes connected to the fanin cells. In Lines 5-12, we check each cell in V top , remove all the buffers and reconnect the nets accordingly.
Algorithm 1 Extract top-level clock tree
for all e(u, v) ∈ E; u, v ∈ V top do 3: Remove clock routing for e(u, v); 4: end for 5: E top ← / 0 6: for v ∈ V top do 7: if v is a buffer then 8:
(v.parent).children ← v.children;
9:
V top ← V top \ {v};
10:
11: end if 12: end for 13: Return G(V top , E top );
In the top-level clock balancing problem, the LP optimizes the delays from an output pin to input pins in every net. For nets with more than one fanout, we modify the net into a binary tree by inserting Steiner points. The purpose of this step is to include the locations of the Steiner points as variables in the LP so as to optimize the non-common paths. Given a net, G net (V, E), and its driving pin, v r , we apply Algorithm 2 to obtain a binary tree. In Lines 8-16, we find the pin pair that minimize the metric ∆L which is defined as the sum of the difference in sink latency 7 and the delay due to the Manhattan distance between these pins. 8 In Lines 17-25, we merge the pin pair that has minimum ∆L by creating a new Steiner point. We define the x and y coordinates of the new Steiner point as the average of the x and y coordinates of the merged pins (Lines [21] [22] . The sink latency of the Steiner point is defined as the maximum sink latency of the merged pins (Line 20). The procedure split_net() is invoked repeatedly until all driving pins have a single connection (to a Steiner point). Figure 5 illustrates our Steiner point insertion algorithm. In the first iteration, we merge pins j 2 and j 3 because they have the smallest ∆L and Manhattan distance. Pins j 2 and j 3 are then connected to Steiner point j 2 (red square). The location of j 2 is defined by the average of the x and y coordinates of pins j 2 and j 3 . In the second iteration, we merge pins j 1 with j 2 because they have a smaller ∆L even though the Manhattan distance between pins j 1 and j 2 is larger than the Manhattan distance between pins j 4 and j 2 . In the last iteration, we merge j 4 and j 1 . Note that our algorithm selects the pins to merge based on the sum of Manhattan distance and the difference in sink latency. This is different from the algorithm in [7] which selects the pins based on Manhattan distance only. For example, the algorithm in [7] will merge j 2 and j 3 , followed by j 4 and j 1 . As shown in Figure 5 (the upper-right clock tree), the algorithm in [7] will lead to a clock tree that will require more buffers to be inserted (red arrows) to balance the clock latencies (green arrows) compared to the tree produced by our algorithm (the lower-right clock tree).
Algorithm 2 Create Steiner points
min_∆L ← ∞;
8:
for (u 1 , u 2 ∈ v r .child) do 9:
11:
if (∆L (u 1 , u 2 ) ≤ min_∆L ) then
12:
u min1 ← u 1 ; 13:
14: 
19:
u .child ← {u min1 , u min2 }; 20:
21:
p x (u ) ← (p x (u min1 ) + p x (u min2 ))/2;
22:
23:
24:
V ← V ∪ {u };
25:
E ← E ∪ {e(u , u min1 ), e(u , u min2 )}; 26: end while 27: E ← E ∪ {e(v r , u )}; 28: end if 29: Return G net (V , E );
By solving the LP, we obtain cell locations, clock routes (Steiner point locations) and net delays in the top-level clock tree. Next, we insert buffers in the top-level clock tree to guide clock routing and control clock skews. For each two-pin net in the optimized toplevel clock tree, we insert buffers according to the steps described in Algorithm 3. In Line 1, we initialize the variable n, which indicates the number of inserted buffers, to 1. In Lines 2-14, we calculate the number of buffers required to meet the delay target as a function of net delays and buffer delays. M bu f is the minimum required spacing between two buffers. 9 The while loop exits when the sum 8 We convert the Manhattan distance to delay by a conversion factor β k at the reference condition. 9 We use M bu f = 5µm in our experiments. of net and buffer delays (d est ) exceeds the required delay between the pins i and j (d req ). In Lines 15-21, we calculate the minimum wirelength required to insert n buffers. If this wirelength is less than or equal to the Manhattan distance between pins i and j, M(i, j), we place the buffers in an L-shaped (y-axis first, followed by x-axis) manner. Otherwise, we place the buffers in a U-shaped manner because total wirelength is > M(i, j). U-shaped placement is the general case, and L-shaped is a special case of U-shape when total wirelength is ≤ M(i, j).
Algorithm 3 Insert buffers
Procedure insert_buffers() Input : pins i and j, d req (i, j) Output: inserted buffers 1:
// calculate number of buffers to meet required delay 2:
n ← n − 1; 
EXPERIMENTS
To test the effectiveness of our methodology, we require testcases with complex top-level clock trees. Since existing benchmarks [10] [23] typically lack complex top-level clock trees, we generate testcases based on common clock tree structures typically found in high-speed SOCs and IPs [21] [22] . The clock structures of our testcases are shown in Figures 6(a)-(f) . We use dual-Vt 28nm foundry FDSOI libraries and implement each testcase at two operating modes -{1.25GHz at 0.95V} and {1.667GHz at 1.20V}. We perform placement and routing (P&R) using a commercial tool and use Synopsys PrimeTime vH-2013.06-SP2 [25] for timing analysis. Table 1 shows the timing analysis parameters in our experiments.
Testcase Description and Generation
Testcases from Tsay [19] , Kahng and Tsao [10] and ISPD-2009/2010 [23] CNS contest benchmarks lack CLCs and are insufficient to create complex top-level clock hierarchies. Kahng et al. [11] improve CTS testcases by adding CLCs (Figures 3(a) and 3(b) in [11] ) but two key elements ignored: (1) logic between FF groups and hence critical paths between FF groups; and (2) multiple clock sources. The CTS problem becomes difficult when synchronous and asynchronous clocks need to be balanced across multiple FF groups. We improve over [11] by (1) adding combinational logic with varying number of stages between FF groups, (2) adding multiple synchronous and asynchronous clocks, (3) using CLCs at different hierarchies to make the clock balancing problem very complex, (4) creating multiple top-level clock hierarchies, and (5) performing CTS with MCMM and OCV constraints. Figures 6(a)-(f) show the six testcases T1-T6 used in our experiments. These testcases use three clock sources typically seen in SOC designs [21] and can have large fanouts (e.g., >1000 FFs). The clock source m_clk is from the crystal oscillator, clk is the output of a PLL and scan_clk is the test clock. Clock sources m_clk and clk are used to implement low-power modes of operation, such as DVFS. The testcases use three kinds of dividers (DIV2, DIV4, DIV8 in figures), a glitch-free clock MUX, and integrated clock gating cells (CGCs) as CLCs. Outputs of all dividers are sources of generated clocks; the generated clocks typically drive FFs for debug/tracing, IO and other peripheral logic.
To implement variable stages of combinational logic, we use NetGen [26] and vary #stages from 15 to 30. To model different critical paths, we connect FFs across groups as well as within the same group using these logic stages. To obtain floorplan dimensions that resemble SOCs, we use multiple instantiations of an interface logic module (ILM) of the jpeg_encoder design from OpenCores [24] . We create a netlist with the top module x5_jpeg, in which we instantiate the jpeg_encoder design five times, perform SP&R and generate an ILM. Note that in this paper, we do not optimize the bottom-level clock tree. Therefore, instantiation of the same x5_jpeg multiple times (instead of using different modules) does not change the outcome of our experiments. We connect multiple instances of the ILM using combinational logic stages. For all CLCs, we implement custom netlists in the 28nm foundry FDSOI technology, and group FFs within the CLCs into their own skew groups so that these FFs do not affect global skew and latencies. The path latencies of FF groups are controlled by changing timing constraints and the number of stages of combinational logic between the groups. To allow a blockage-free placement region for the CLCs, we place ILM blocks (hard macros for the CTS tools) in an L-shaped manner along the periphery of the core as shown in Figure 7 (a).
All testcases contain bidirectional paths, i.e., both launch and capture FFs appear in FF groups that are driven by the fastest clock and other slower clocks. In addition, the fastest clock drives around 90% of the FFs that do not belong to the ILMs. Table 2 shows #CLCs, #cells, the FFs not in ILM, FFs in the ILM, FFs at the ILM boundary, and the area of each testcase (design in table) . Testcases T2, T3 and T6 contain critical paths between FFs from two different clocks, one with large latency and the other with small latency. The CTS problem is complicated by the need to balance skew between these FF groups. Testcases T1-T4 contain multiple generated clocks and reconvergent paths between these clocks. These testcases make the CTS problem complex because skew needs to be balanced between fast and slow clocks. In testcases T3-T5, the control signals of CGCs are generated by clk, which makes the latency of the signal to the enable pin of the CGCs very critical. Besides balancing skews, CTS also needs to balance the critical path delays of the enable signal to the CGCs along with the clock latency. To report timing paths across clocks accurately, we set the path multiplier in the Synopsys Design Constraint (SDC) [3] file for paths between all clocks. Table 3 summarizes the key metrics of the clock tree before (I = Initial, produced by a commercial tool) and after (O = Optimized) applying our top-level clock tree optimization. Rows 1-14 in Table 3 show the results at the post-CTS stage, while show the results at the end of the implementation flow (after datapath routing). 10 
Experimental Results

Post-CTS stage:
Our optimization flow reduces the total wirelength of the top-level clock tree by 53% to 68% across all six testcases. Figure 7 shows that wirelength reduces because our flow clusters the CLCs such that the clock tree does not split near the clock entry points. The large wirelength reduction suggests that the initial CLC placements by EDA tools may not be aware of the CTS requirements. The smaller wirelength enables the optimized clock tree to also reduce the number of buffers. In testcases T4 and T5, the number of buffers is larger, as our optimization flow inserts more buffers in the clock tree to improve timing slack. To estimate switching power, we extract gate and wire capacitances of the toplevel clock tree. Rows 5-6 in Table 3 show that our flow can reduce the switching power in the top-level clock tree by 12% to 40% for all testcases, including testcases T4 and T5, where the number of buffers increases.
Our flow also improves the setup WNS and TNS by up to 550ps and 255ns, respectively (Rows 7-10). Hold WNS and TNS are also improved except for testcase T6, in which the hold WNS and TNS worsen by 110ps and 780ps, respectively (Rows 11-14). Our optimization flow can worsen hold WNS and TNS because we focus on improving the setup slacks (γ k = 5). The tradeoff between setup and hold slacks is based on the following assumptions: (1) hold time violations are easier to fix in post-CTS implementation of the top-level clock trees is shown in black. Our flow splits common paths farther from the clock root compared to the initial clock tree. As a result, the total wirelength in the top-level clock tree is reduced from 45mm to 22mm. stages, and (2) some of the hold time violations are fixed by the increased wire delays in the routing stage. In Rows 30-34 of Table 3 , we report runtimes of the main procedures in our optimization flow. We spend most of the time to extract timing information and to formulate the LP. 11 CLC placement, buffer insertion, legalization and routing only take 10 minutes in total because there are not many cells in the top-level clock tree. The total runtime is 135 minutes on average. Testcase T3 has a higher runtime because it has more timing-critical paths than other testcases (Row 29).
Post-datapath routing stage:
To study the benefits of our optimization flow, we also compare the post-routing results between the initial and the optimized clock trees. The results in Table 3 show that all designs with the optimized clock tree have the same or improved setup WNS compared to the designs with the initial clock tree (Rows 21-24). The improvement in setup WNS at the post-routing stage is up to 320ps. Although some testcases with the optimized clock tree have worse hold slacks (i.e., testcases T4, T5 and T6), the differences are less than 100ps. The results in Rows 15-16 shows that our optimization flow reduces the total wirelength by 38% to 51% across all six testcases. The improvements are smaller as compared to the post-CTS stage because the total wirelength of the initial and optimized clock trees 11 Solving the LP takes less than 30 seconds. both increase at the post-routing stage due to wiring of the signal nets. Total number of buffers and switching power at the postrouting stage are similar to values seen at the post-CTS stage.
CONCLUSIONS
Designing a balanced top-level clock tree with multiple clock sources is very complex as we need to consider MCMM, OCV and timing constraints across FF groups. We develop a CTS methodology that optimizes CLC placement and buffer insertion, and that minimizes non-common paths between FF groups. We formulate the top-level CTS problem as the minimization of a weighted sum of WNS, TNS, clock uncertainty due to OCV and wirelength. We solve this problem using LP and develop heuristic flows to insert Steiner points and buffers, which are required elements of a top-level CTS solution. We also develop generators for testcases that resemble clock tree structures typically found in high-speed SOCs. We validate our optimization flow on testcases from our generators and achieve up to 51% reduction in wirelength for the top-level clock tree, and 320ps improvement in WNS, compared to a leading commercial CTS tool. Our future work includes (i) handling obstacles, (ii) accounting for optimal buffering solutions, (iii) creating testcases to capture other important SOC elements such as memory controller and multimedia blocks, and (iv) joint optimization of the top-and bottom-level clock trees.
