Temperature variation in microprocessors is a workload dependent problem. In such a design, the clock skew should be minimized with respect to temperature variation. Existing work has studied clock tree embedding perturbation considering time variant temperature variation. There is no existing method that can reduce skew variation. This paper develops an efficient yet effective simultaneous hotspot avoid embedding and thermal aware routing (TMST) method, where hotspot embedding avoid tree topology located in area with high temperature possibility and thermal aware routing reduce skew in tree path with more smooth temperature area. With a thermally tolerable tree structure, our method can reduce not only delay skew but also skew variation (skew violation range). Compared with existing temperature-aware clock tree method, our TMST solution reduces skew variation by 2X compared with the Greedy-DME (GDME) method of Edahiro and existing thermal aware clock synthesis TACO and PECO. With the scale from 100 down to 1 temperature maps, our TMST also guarantees the smallest wire length overflow. TMST reduces the worst case skew up to 4X than PECO and 5X than TACO.
I. INTRODUCTION
Clock synthesis has been extensively studied. Given a set of sinks (flip-flops), clock synthesis finds an abstract topology, embedding and minimizes mismatch of arrival times (i.e., skew) between sinks. Wire length minimizations under zero or bounded skew constraint were presented in [1] , [2] , [4] . However, process, supply voltage, and temperature variation induced skews were not considered. The exponential advance of very large scale integration (VLSI) technology has resulted in a high and non-uniform power dissipation over chips [4] , which leads to a temperature gradient, i.e., spatial temperature variation over a chip that can cause non-negligible delay differences in both interconnects and devices. As clock signals are globally routed over chip, the temperature gradient can bring significant skew variations [5] . Therefore, the traditional constant zero/bounded skew routing methods become invalid.
The history of clock tree synthesis under thermal variation is short. TACO [5] suggests to construct a tree that balances the skew under the two given static thermal profiles (uniform and worst). The argument is that the skew optimized under one thermal profile will vary significantly under the other profile, thus the need to balance. TACO first builds a nonbuffered, zero-skew clock tree under a uniform thermal profile and refines it under worst-case (non-uniform) profile. Delay time from the source to sink k.
S q
Steiner point q. s 0 Source node.
T (t, n)
Temperature over the chip at time t for each node n. T Mean-temperature for nodes n. cc p Correlation coefficient in grid p.
M j
Merging point j.
d(M j , s k )
Routing path from merging point j to sink k. r unit Unit-length resistance. Since the thermal gradient is changing over time while the chip is running, a transient analysis/optimization is an extremely difficult task, if possible. Considering the skew caused by the spatial variation of temperature, PECO [5] is the only existing clock synthesis method considering time-variant variation by perturbation of embedding points.
With further merging points, PECO causes more wire length overhead. PECO has not discussed the skew variation issue, neither.
The skew variation problem can not be ignored in current very large scale integration (VLSI) technology. A smaller skew variation region, can lead to a more stable system and a more accurate timing analysis. This paper focuses on temperature variation induced skew and skew variation minimization. We propose a simultaneous hotspot avoid embedding and thermal aware routing (TMST) method to minimize skew variation and the worst case skew considering the on-chip temperature variation and spatial correlation with respect to operating time. The main contributions of this paper are as follow: First, without any extra components like corsslinks or buffers insertion, our proposed synthesis guarantee smaller wire-length which leads to small power consumption. Second, we build up clock tree embedding according to temporal and spatial temperature information makes our tree has smaller dynamic skew range (skew variations), our paper is the pioneer one discussing this issue. Third, our method considers timevariant temperature variation using different kind of delay models. Among all the test, our TMST reachs not only the smallest skew but the smallest skew variation.
The rest of this paper is organized as below: Section 2 shows the notation and problem statement in this paper. Section 3 presents temperature modeling and delay model. Section 4 sums up the overall algorithms. We show our experimental results in Section 5 and conclude the paper in Section 6. Table 1 shows the notations and definitions used in the following sections. Clock skew minimization is an important design task to ensure correct circuit behavior of sequential circuits, and many works have been done to design zeroskew clock tree. The main problem of design uncertainty is error of capacitance estimation. Even if we reduce design uncertainty by using detailed analysis and optimization, clock skew occurs due to manufacturing and environmental variabilities. This paper mainly focuses on temperature issue. There have been different proposed methods such as buffers and crosslink insertion for reducing skew. However, with further wire length or analogy elements, both crosslink and buffer insertion methods significant increase power consumption. Without any extra elements, our method reduces the skew by changing the essential routing path to further compensate skew and skew variation. We build up the entire tree topology and embedding in the high spatial and temporal correlated areas considering time variant temperature gradients. We also avoid routing through and building merging points on hot spot areas. High correlated area guarantees the same trend of temperature change, which leads to more smooth temperature grids.
II. NOTATION AND PROBLEM STATEMENT
Given a clock distribution topology generated by GDME, under the constraints below, we reembedding a new tree which has more high correlated routing paths and merging points.
• 1: We only implement our method with highly correlated sink pairs and keep the same path with GDME. • 2: We build up our clock tree using a bottom-up fashion.
• 3: We avoid building our merging points and routing paths on Hot spot areas.
• 4: All routing paths and merging points are built on the highly correlated areas.
III. MODELING

A. Stochastic Temperature Model and Correlation Map
The same temperature modeling in [9] is assumed in this work. In order to make the paper self-contained, we first review the temperature, gradient and correlation model. The temperature gradient and correlation are obtained by a microarchitecture level power and temperature simulation [11] , where the Alpha architecture is used as an example.
The overall chip is divided into a uniform grid with total N nodes. By applying six different applications (ammp, art, compress, equake, gzip, gcc) from the SPEC2000 benchmark in a sequence (each with a time-period t p ), the thermal-power is obtained by averaging the cycle-accurate (scale of ps) dynamic power in the thermal-constant scale (scale of ms). Using this time-variant thermal power as input, the transienttemperature T (t i , n j ) over the chip is calculated at different time-instant t i for each node n j in the grid. To automatically extract correlation for temperature variations, the temperatures at N nodes are modeled by random processes. Each node is described by a temperature sequence sampled at N timeinstants,
where
is the co-variance matrix between nodes,
are the standard deviations for nodes n i and n j , and
are mean-temperature for nodes n i and n j respectively. The correlation coefficients C (i, j) can be precomputed and stored into a table. Figure 1 shows the distribution of one calculated correlation matrix. The averaged correlation is about 0.8 for the SPEC2000 benchmarks [9] . 
B. Skew Model with Temperature Variation
As suggested by [5] , when a clock wire experiences a temperature gradient, the unit-length resistance r unit is as follows,
where ρ 0 is the unit-length resistance at 0 o C, and β is the temperature coefficient of resistance (1/ 
where E[r unit (x, y)] is the mean value of resistance in edge e (M i , s k ) Following the conventional definition for the propagation delay, the delay from the source node s 0 to sink s i , D (s 0 s i ) , is the time required for the node voltage (waveform) to pass 100% of the peak voltage under the impulse excitation in the source node. After obtaining the source to sink delay of j-th routing configuration Con f i j in level i, we can calculate the worst case skew corresponding to Con f i j as follows
The worst-case skew is then determined by those preserved routing path from all levels.
C. Problem Formulation
The simultaneous hotspot avoid embedding and thermal aware routing (TMST) problem is formulated as follows, Formulation 1: (Simultaneously hotspot avoid embedding and thermal aware routing (TMST)) Given source s 0 , sinks s 1 ··· s n , an initial clock tree embedding, and a set of temperature variation maps, find proper re-embedding (including merging point and re-routing) for the new tree to minimize the worst case skew under the given temperature maps.
IV. ALGORITHMS
A. Overall Algorithm
Given a GDME-initialized clock tree construction, the reembedding by thermal aware Maze routing is performed. The worst-case skew and re-embedding are determined in a bottom-up fashion. At each level, the merging points and routing paths are picked according to their correlation strength. Then the resulting routing path are routed through with strong correlation area, and only those sink paths that could cause large skew changes (high correlated) are selected for reembedding.
In summary, the overall algorithm is shown in Figure 2 , and the algorithm's pseudo code is as presented in algorithm 1:
Put merging point in the largest correlation weight area of merging line.
Maze routing considers weight of distance and correlation cofactor 
B. Correlation Cofactor
Smoother routing path guarantees more stable skew variation since the dynamic temperature variation over time generate different skews. The Thermal Aware Routing Topology Optimization (TMST) is an effective algorithm that considers the time variant temperature variations with spatial and temporal correlation. TMST generates a temperature correlation map by analyzing time variant temperature maps, and avoid the hot spots that are indicated in the temperature correlation map of the clock tree structure. Without using merging point perturbation, we use thermal aware routing to balance the skew, which can still keep the same or similar Manhattan distance. First, we build a macro model for temperature variations to get temperature maps for various timestamps. To model such on-chip time variant temperature, we impose a grid onto the chip and each grid is assigned a temperature range. This temperature range can be obtained by measurement or thermal simulation. A complete instruction set is tested and the corresponding K temperature profiles are obtained. The overall temperature variation can then be obtained based on Generate initial embedding tree by GDME 4: while p j i ! = PC i do 5: Find the merging segment 6: Choose the largest correlation weight area but not hotspot as the merging point 7: Apply thermal aware Maze routing to find the highest correlated path. Because of the linear relationship between wire resistance and temperature, high temperature variance can increase the delay by as much as 100%. Therefore, it is necessary to find a routing path with smooth temperature gradient in order to avoid clock skew that is created by temperature variations.
To find a suitable enable cofactor in thermal aware routing, we use "correlation cofactor" in each of the two target points we want to analyze. Different from correlation coefficients, correlation cofactor in area p to targets i, j is defined as the sum of corresponding two correlation coefficients cc p = cov(i, p) + cov(p, j). The area with high correlation cofactor value represents high correlation with both routing target points. TMST method tries to build up the entire tree in the areas with high correlation cofactor values which leads the clock tree with better thermal tolerance. Further avoidance of hot spot areas can further reduce the worst case skew.
Definition 1: Correlation cofactor in area p to routing point S, T is defined as the sum of corresponding two correlation coefficients cc p = cov(S, p) + cov(p, T).
C. Thermal Aware Maze With Dynamic Window
High temperature variability greatly impacts the delay and skew of the zero-skew clock tree. Figure 3 shows that different routing paths with different temperature gradients will have different skews. The high temperature variance increases the delay by 100%, i.e., from 2ns to 4ns. In Figure 3 , on the one hand, we can see the subregion with high temperature. On the other hand, we also notice that we can find the path with smooth temperature gradient but keep the short wire length as well.
Routing is a crucial step in clock tree synthesis design. Most of existing thermal aware clock tree synthesis uses roughly Manhattan distance to simulate and evaluate the delay between each two nodes [6] , [9] . In real case, Manhattan distance is not exactable when considering time-variant temperature maps if our objective function consider both weight of temperature and distance parameters. We use thermal aware routing Maze Routing with considering both distance and temperature correlation cofactor in this paper, which not only is an accurate method for evaluating the delay between each node but also can reduce the skews and skew variation significantly. However, not all path with high correlation lead better thermal toleration ability. With two fully components temperature difference path sometimes cause better skew reduction. In order to choose suitable path candidates, we only target path according to sink point correlation strength in this paper.
D. Overview
As shown in Figure 4 (a), we get the initial tree from GDME topology. Unlike GDME, we build the small dynamic window with each candidate pair before designing merging point and routing path. Dynamic windows are small grids with weight of spatiotemporal information. We determine the routing paths and merging paths based on dynamic windows level by level as shown in Figures 4(c) and (d) . Figure 5 presents a short overview on how to build up the dynamic windows. The green spots are the sinks of the clock tree, and the stars are the hotspot areas.
For grid j, its correlation coefficient related to sink k is 0.25. Its correlation coefficient related to sink i is also 0.25. The sum of these two coefficients is the correlation cofactor between these two sinks, which is 0.5 in this case. Only the correlation coefficients of the grids that are inside of the dynamic window of two sinks are calculated. After obtaining the correlation coefficients, we find the merging segment by calculating the Manhattan distance. We determine the merging point by avoiding the hot spots and search for the grid with the highest correlation coefficient on the merging segment. In this grid map, the highest correlation coefficient is 1, but it is in the hotspot area. Instead of choosing the highest correlation coefficient area, we choose the next highest correlation coefficient as our merging point in order to avoid the hotspot. As illustrated in figure 6 , after choosing the merging point, we apply thermal aware Maze Routing to select the highest correlated path, while mantaining the same Manhattan distance.
V. EXPERIMENTAL RESULTS
The proposed algorithm has been implemented in C++ programming language and Matlab. We report runtime using a Linux workstation with 1.9GHz, P4-CPU, and 2GB memory. We employ the standard clock-tree benchmarks r1-r5 which are the same benchmarks in TACO and PECO. Two more industry cases JPEG1 and JPEG3 are tested in this paper. The initial zero-skew tree is constructed by the GDME [4] method under the Elmore delay model with no temperature variation. The interconnect has unit resistance r 0 = 0.1Ω/mum, unit capacitance c0 = 2.0 × 10 −16 F/μm, and temperature sensitivity β = 0.0068, The above interconnect parameters are the same as those in TACO [6] .
A chip with size of 6cm 2 is divided into a uniform grid to obtain the distributed temperature map by a micro-architecture level cycle-accurate power /temperature simulator [11] . Our experiments use six SPEC2000 applications (art, ammp, compress, equake, gcc, and gzip). We collect 100 temperature maps by simulating these applications in a sequence and recording temperature maps for every 10 million clock cycles after fast forwarding of 1 billion cycles. These applications lead to a temperature variation about 50 o C over the 100 temperature maps. We then find regions of hot spot with high temperature variation (variance), and extract correlation matrix, i.e., covariance for pairwise regions. Below, we report the minimum-cost bounded-skew routing tree problem under the pathlength (linear) delay models in table II. Compared with existing temperature-aware clock tree method, our TMST solution reduces skew variation by 2X compared with the Greedy-DME (GDME) method of Edahiro and existing thermal aware clock synthesis TACO and PECO. With the scale from 100 down to 1 temperature maps, our TMST also guarantees the smallest wirelength overflow. TMST reduces worst case skew up to 4X than PECO and 5X than TACO.
A. Comparison With Existing Techniques
We show the skew comparison with linear delay model in table II. GDME gets zero skew since it build up tree in H shape. TMST gets the best pathlength skew delay in all comparison. Table 4 shows the run-time comparison between different algorithms. We do detail routing with each two connecting pairs instead of roughly Manhantan distance, and further accurate high order delay model, TMST consumes much run time cost compared with other existing methods.
Table V compares our proposed TMST with the existing techniques GDME [3] , and thermal aware clock synthesis PECO [9] and TACO [6] with SPICE. We scale from 100 We present the skew violation distribution of r4 in Figure  7 and Figure 8 , note that our TMST has not only small worst case skew but also small skew variation. Figure 9 shows the clock tree built by our TMST method in JPEG1 case. Significant bending in the route between nodes can be seen.
VI. CONCLUSIONS AND FUTURE WORK
Existing clock tree and hybrid clock synthesis did not consider extra skew caused by temperature variation, or [8] needed to assume a time-invariant worst-case temperature map to find the worst-case skew. Those papers also use Elmore delay model to evaluate clock skew. In this paper we have made the first attempt to reduce skew variation considering time variant temperature variation. We have developed a minimal skew clock tree embedding that considers temporal and spatial correlation issue. Thermal aware routing topology optimization is used to avoid hot-spots and compensate temperature variation. With the entire tree in the smoother temperature areas, our tree synthesis can significant reduce the skew variation and worst case skew. Compared with existing temperature-aware clock tree methods, our TMST solution reduces skew variation by 2X compared with the Greedy-DME (GDME) method of Edahiro and existing thermal aware clock synthesis TACO and PECO. With the scale from 100 down to 1 temperature maps, our TMST also guarantees the smallest wirelength overflow. TMST reduces worst case skew up to 4X than PECO and 5X than TACO. 
VII. ACKNOWLEDGMENTS
The authors gratefully acknowledge the contribution of professor Andrew Kahng's research group from UCSD, professor Lei He from UCLA and reviewers' comments.
