3D integration has new manufacturing and design challenges such as timing corner mismatch between tiers and device variation due to Through Silicon Via (TSV) induced stress. Timing corner mismatch between tiers is caused because each tier is manufactured in independent process. Therefore, inter-die variation should be considered to analyze and optimize for paths spreading over several tiers. TSV induced stress is another challenge in 3D Clock Tree Synthesis (CTS). Mobility variation of a clock buffer due to stress from TSV can cause unexpected skew which degrades overall chip performance. In this paper, we propose clock tree design methodology with the following objectives: (a) to minimize clock period variation by assigning optimal zlocation of clock buffers with an Integer Linear Program (ILP) formulation, (b) to prevent unwanted skew induced by the stress. In the results, we show that our clock buffer tier assignment reduces clock period variation up to 34.2%, and the most of stress-induced skew can be removed by our stress-aware CTS. Overall, we show that performance gain can be up to 5.7% with our robust 3D CTS.
INTRODUCTION
TSV has been gained main focus for future SoC integration. Therefore, we need design methodologies for TSVbased 3D-ICs, especially in the clock network design. There have been several works on CTS in 3D-ICs. BURITO [1] addresses buffered clock tree in two stacked dies, and the work in [2] clarifies the whole flow for the 3D CTS in N-stacked dies without buffer insertion. The paper in [3] proposed prebond testable CTS methods. However, the previous works have not considered new design challenges for 3D-IC such as inter-die variation and TSV induced stress.
Process variation can be decomposed into three components [4] : wafer-to-wafer (inter-die) variation, intra-die variation and random variation. The main challenge of 3D design comes from integration of tiers in different timing corners, which means that cells along a path can have totally different characteristics on variation. In addition, cells in different tiers lose their spatial correlation. In other words, cells placed only in the same tier are spatially correlated in process variation. The paper in [5] proposed how to select tiers for 3D integration based on the pre-bond measurement data in order to maximize parametric yield. In this paper, we propose more aggressive clock network design to take advantage of timing corner mismatch. After all the cells and signal TSVs are placed, we can adjust clock buffer zlocation to minimize sum of covariance for better timing yield. We propose an ILP formulation to determine clock buffer z-location for optimization of near critical paths.
Another design for manufacturing (DFM) challenge of 3D-ICs comes from difference of Coefficients of Thermal Expansion (CTE) [6] [7] [8] . Because CTE of copper is larger than the value of silicon, tensile stress appears on silicon near TSVs after cooling down to room temperature. The stress can change clock buffer driving capability due to mobility variation. Since PMOS is more sensitive to silicon stress [6] , rising delay has more impact on the stress, which means that clocking scheme using positive edge triggered flip flop is more susceptible to TSV induced stress. In this paper, we propose buffer delay model for the stress and stress aware clock network design.
Initially, we generate an abstract tree. Since the abstract tree does not provide where clock buffers are inserted, we cannot determine z-location of clock buffer before clock buffer insertions are determined. To break this problem, we use a bottom-up tree construction approach from sink to source, iteratively. At leaf nodes, we identify if buffer insertion is required, then, determine z-location with our ILP formulation which works for minimizing clock period variation. We fix z-location of buffers determined already at the previous steps. In the next level, we find buffer insertion points and determine z-location of buffers for nodes. Iteratively, z-locations of clock buffers are determined to optimize timing yield until it reaches to a clock source. Meanwhile, buffer delay with the stress is calculated and considered.
The contributions of this paper include the followings:
• To our best knowledge, this is the first work to show that clock buffer tier assignment can play a role to reduce clock period variation.
• With our clock buffer assignment, we show that standard deviation of clock period can be reduced up to 34.2%. Thus, we can increase chip operating frequency for the same timing yield, or we can increase timing yield for the same operating frequency.
• This is also the first work to show that TSV can cause unwanted stress and propose buffer delay variation model to consider the stress during CTS.
The rest of the paper is organized as follows. We will show related work and motivation in section 2. We will propose our robust clock tree construction in section 3. Experimental results will be shown in section 4, and we will conclude in section 5.
RELATED WORK AND MOTIVATION
Clock period(CP ) under process variation is determined by the following equation 1 at 3σ-level.
Here, mean of clock period is determined by equation 2. T CtQ and Tsetup are clock to q propagation delay and setup time for a flip-flop, respectively. Combinational logic delay is denoted by T logic . T skew is clock skew for a clock network.
There are two ways to improve chip performance, thereby, enhance timing yield during CTS. First, we can try to minimize σ cp. In this paper, we show that σcp reduction can be achieved during CTS for 3D-ICs. The second method is to minimize μ cp. We can achieve the goal by T skew reduction in 3D CTS by considering TSV induced stress.
978-1-4244-7516-2/11/$26.00 ©2011 IEEE 7C-1 Fig. 1 shows a clock path spreading along two dies. We define clock buffers connected to F/Fs for path inputs as type-A buffers. In a similar way, clock buffers connected to F/Fs for path outputs are defined as type-B buffers. In Fig. 1 , buffer A is type-A, and buffer B is type-B. Let F 1 and B be placed in die0 and L1 and F 2 be placed in die1. If z-location of clock buffer A is flexible, we can assign clock buffer A to either die0 or die1. Intuitively, we can assign A to die0 to avoid TSV insertion between A and F 1. However, we have to consider covariance between A and other cells during z-location determination for the buffer.
Clock buffer tier assignment
For the path p in Fig. 1 , μ cp is determined by equation 3 if propagation delay from clock source to buffer A is the same with delay from clock source to buffer B. Here, E(F 1) stands for mean delay of F 1 clock to q, and E(F 2) is mean value of F 2 setup time. Mean value of each cell delay is denoted by E(cell).
Variance of CP for the path p is determined by equation 4. Variance of each cell delay is denoted by V ar(cell). σ(p) 2 cp is the sum of variance of each gate and covariance between two cells. Since two cells in different dies lose their correlation, their covariance terms become zero. After cell placement, we can still determine clock buffer z-location in order to minimize sum of covariance, which reduces σ cp and enhances operating frequency and timing yield in 3D-ICs.
In our example, let each cell have the same variance and covariance which are denoted by V AR and COV , respectively. If buffer A is placed on die1, Cov(A, F 1), Cov(A, B), Cov(F 1, L1), Cov(F 1, F 2), Cov(L1, B) and Cov(F 2, B) become zero because they lose their correlation. Similarly, we can obtain sum of covariance when buffer A is assigned to die0 in equation 5. We can minimize clock period variation by putting buffer A into die0.
If buf f erA is on die0, σ Strained silicon has been used to enhance I on of a transistor [9] . However, in 3D-IC manufacturing, unwanted stress is caused by CTE mismatch between copper TSV and silicon as shown in Fig. 2 . Investigations [10] show that at 200
• C an anneal time of 30-60 minutes is required to achieve reasonable copper properties. Since CTE of copper is larger than that of silicon, after annealing, copper has less volume compared with silicon. Several papers have been published to simulate TSV induced stress [7, 8] using finite element analysis(FEA) simulation. They show that TSV can cause tensile stress of more than 200MPa. Systematic clock buffer variation due to TSV stress should be considered for clock tree construction in 3D-ICs. In Fig. 3 , we propose 3D CTS to deal with new challenges presented in section 2. The first step is to generate an initial abstract tree having minimum wire-length with 3D-MMM algorithm [1] . 3D-MMM algorithm constructs a 3D abstract tree with decision of z-location of merging points in a recursive top-down manner. We assign the clock TSVs under a given TSV upper bound, and determine the hierarchical connection among the clock sinks, internal nodes and clock TSVs. The abstract tree has only merging point and child node information. In other words, after abstract tree generation, we do not know where clock buffers are inserted. Therefore, we cannot decide z-location of a clock buffer. However, to determine buffer insertion, we need to know TSV insertion point and buffer z-location to calculate downstream capacitance. To break the problem, we propose a level by level buffered clock tree construction approach from sink to source as illustrated in Fig. 4 .
ROBUST CLOCK TREE DESIGN
First, as shown in Fig.5 (a), we identify buffer insertion points if the downstream capacitance is bigger than allowed maximum capacitance. Then, in Fig. 4 (b), we determine zlocation of buffers in order to minimize covariance terms with an ILP formulation in section 3.1. After buffer zlocation determination, we need to adjust the z-location of merging point of the up-stream tree in order to minimize TSV insertion in Fig. 4(c) . On the next level of the abstract tree, the same procedures are executed in Fig. 4(d) . Once z-location is determined for a clock buffer, we determine x and y location of buffers. After that, buffer variation due to TSV stress is calculated and wire-length is calculated to get rid of skew in section 3.2.
σCP minimization for critical paths
7C-1 From the observation in section 2.1, our goal is to minimize sum of covariance by assigning clock buffer z-location optimally.
M inimize
Our problem is defined in formulation 6. Every pair of two cells in a clock path has a covariance value denoted by αi,j. M is the number of instances including clock buffers, flip-flops and logic gates in a clock path.
Covi,j shows their relations for covariance in the boolean equation 7. If z-location of cell i is the same with that of cell j, Cov i,j becomes one. Otherwise, Covi,j becomes zero, which means that there is no spatial correlation between two cells. D i,n is a binary variable used to indicate z-location of cell i. For example, if D i,0 is one, cell i is placed on die0. N is the number of tiers to be stacked for 3D integration.
By combining formulation 7 and 6, we can obtain an ILP formulation to minimize covariance in formulation 8 for the most critical path. Y i,j,k are temporal binary variations introduced to convert AN D operation (D i,k D j,k ) to ILP. If z-locations of two cells are already determined during 3D placement, we can skip the pair in formulation 8 and save runtime for solving the ILP formulation.
Clock buffers can be connected to multiple clock sinks. If a buffer z-location determined by one path differs from zlocation determined from another clock path, there will be conflicts of optimization procedure.
In addition, we need to prevent insertion of multiple TSVs between consecutive clock buffers. For example, a parent buffer can be assigned to die3 when a child buffer is already fixed to die1. In that case, two TSVs are required. To avoid the hopping problem, we restrict z-location for parent buffer i from t − 1 to t + 1 when pre-determined child buffer is on die t as shown in formulation 9.
We extend the ILP formulation to optimize multiple critical paths in formulation 10. L is the number of targeting paths for our optimization problem. M is the number of instances including clock buffer, flip-flop and logic gates in clock path p. t and t are child node z-locations for clock buffer i and j, respectively. The formulation aims to minimize delay variation for the selected critical paths.
We use the spatial correlation model in [11] to consider distance factor of spatial correlation as shown in equation 11. Let covariance between two cells i and j be Cov(i, j). We can characterize Cov(i, j) from Hspice measurement. ρ i,j is the distance factor to represent that spatial correlation reduces as distance between two cells increases. x i,j means geometrical distance between two cells. If x i,j is smaller than X L, ρi,j decreases as xi,j increases. When xi,j reaches X L, ρi,j becomes ρmin. The proposed formulation can insert many TSVs between clock buffers as shown in Fig. 5(a) . In order to control the number of TSVs, we introduce a new parameter β i,j in equation 11. By increasing β i,j , we can decrease αi,j, thereby, raise the possibility of assigning clock buffer i and j to the same die. It can reduce the number of inserted TSVs. We can explore the optimal β i,j value to minimize clock period variance at the specific number of TSV insertion. α i,j has 7C-1 minus sign only if one clock buffer is type-B defined in section 2.1 because variation of type-B buffer can compensate overall clock period variation.
Buffer variation modeling under TSV induced stress
Our stress induced variation modeling consists of three steps: 1) compact stress modeling, 2) piezo-resistive model to calculate ΔM obility, 3) buffer characterization by sweeping hole and electron mobility. Since FEA simulation takes several hours even for single TSV stress simulation, we use the analytical compact model in [7] and linear superposition for multiple TSVs [6] as a practical way. Then, we convert the stress to mobility variation with piezo-resistive model. Since mobility variation due to stress depends on not only applied stress strength but also orientation between TSV and transistor channel [12] , we use the modified piezo-resistive model in equation 12. Here, Π is the tensor of piezo-resistive coefficients for holes and electrons [13] , O f (θ) is an orientation factor which is obtained from empirical data in [12] and θ is the degree between center of TSV and transistor channel.
Clock buffer delay is pre-characterized according to hole and electron mobility variation. Assuming rising edge triggered flip-flops, our concern on buffer delay variation can be narrowed to rising delay only. In table 1, we present rising delay variation to show how much clock buffer delay can be changed by mobility variation. We can extend the work to falling edge triggered cases in a similar way. From the table 1, rising delay variation mainly depends on hole mobility variation because PMOS is used to charge output capacitance during the rising transition. We use NanGate library and 45nm PTM model [14] to characterize the delay variation. To show clock buffer variation under our modeling, we present rising delay contour based on the proposed modeling with four TSVs in Fig. 6 . Fig. 6(a) shows TSV induced stress contour. Radius of TSVs is 2um and Keep-Out-Zone (KOZ), denoted by gray cylindrical shape, is 1um. Stress due to the TSV is approximately 150Pa out of KOZ. Fig. 6(b) , (c) shows electron mobility and hole mobility variation contours, respectively. Since hole mobility can be either enhanced or degraded based on relative orientation between a TSV and a transistor channel, we can see that hole mobility is more susceptible to the stress than electron mobility. Finally, Fig. 6(d) shows buffer delay variation contour for rising transition. As we expect, rising buffer delay is strongly depending on hole mobility variation. In the four TSVs case, we observe approximately 10% delay variation for clock buffers from -3% to +7%. Therefore, TSV stress can lead excessive skew if we do not take account of TSV induced stress effect during CTS.
3D buffered clock tree synthesis (CTS)
The major difference between 2D and 3D clock tree comes from TSVs. TSVs not only add much larger capacitances which cause more buffer insertion than 2D clock tree, but also give stress to the clock buffer nearby and changes the effective resistance of the buffer. Since TSV may lead to manufacturability problems as well, it is desirable to reduce the number of TSVs during 3D CTS, besides the fundamental goal of 2D clock tree, zero skew with minimum wire-length. The 3D CTS is done in bottom-up manner in this work. We assume that TSVs for logic paths are already fixed, and TSVs for clock trees can be arbitrary located unless there is an overlapping with other TSVs or cells.
Abstract Tree Generation : As briefly explained in section 3.1, we use 3D-MMM algorithm to get the abstract tree from given sink location under the given TSV upper bound [1] . After this step, z-location of each merging point (MP) is determined.
For every depth in bottom-up manner, do followings: a) Identify candidates for buffer insertion, if child node capacitance exceeds predefined capacitance. b) Determine z-location of buffer, using the ILP formulation to minimize covariance. ILP formulation uses the clock tree information which has been constructed so far, and logical path information to make the optimal z-location of newly inserted buffer. If the z-location of buffer determined by the ILP formulation is different from the z-location of child node, a TSV is inserted between child node and buffer. If buffers on two edges are assigned to the same tier and MP is not, we substitute MP tier to buffer tier in order to reduce the number of TSVs. c) Determine (x, y) location for clock buffers. To get the delay variation of buffer due to TSV stress, we need to fix buffer and TSV location. For simplicity, we assume that an additional TSV due to step b), if needed, is located immediately after the child node. To determine buffer location, we calculate maximum allowed wire-length from the child node to the buffer to guarantee small enough capacitance. Fig. 7 shows wire, TSV, and buffer models to calculate downstream capacitance and downstream delay. Buffer's (x, y) location is the non-overlapping point with the (TSV + KOZ), on the line connecting two child nodes, within the maximum allowed wire-length from a child node. d) Get the wire-length of each edge. Based on the downstream capacitance and downstream delay of left and right child nodes, we calculate the wire-length from a child node to merging point to meet zero skew. Since we already know the exact (x, y, z) location of child node, we also have 7C-1 the minimum wire-length between two child nodes based on half perimeter model. As shown in Fig. 8(a) , we need to search the location of merging point on 1-dimensional coordinate, from zero(child1) to totalWL(child2), where
We use binary search to get the wire-length of each edge. To be more specific, as depicted in Fig. 8(a) , from the current reference point, point1 = γ for the merging point, calculate the skew at the point2 = (γ+d l ) and at the point3 = (γ-d l ), where d l is the unit length to move. If skew at point2 is the minimum between three, we move the reference point to the right side, and if skew at point3 has the minimum skew, next reference point will be in the left side. The location of reference point, γ, can be determined using the following equation 14, where i indicates the iteration index for binary search. When the skew at a certain point is smaller than the skew tolerance, calculation of wire-length from child node to merging point is finished. In this paper, we use the maximum iteration for binary search as 15, which can guarantee 3nm resolution for 100um wire-length. Elongation of the wire is needed when skew at left child node or right child is the minimum along the whole wire, and if it is larger than the skew tolerance. In such a case, we can calculate the wire-length to be elongated as explained in [15] . e) Determine (x, y) location of merging point and TSVs. Merging point can be placed somewhere in between two child node in x-y plane. We decide (x, y) location of merging point and TSVs based on the ratio of wire-length in left and right edge, as described in Fig. 8(b) . The (x,y) location of merging point can be expressed as equation 15.
Similarly, (x,y) value of TSV can be determined in the same manner because they are evenly distributed along the edge. For example in Fig. 8(b) , TSV for child1 is located in the middle of child1 and MP. f ) Calculate stress-induced buffer resistance and refine the wire-length to compensate it. With the stress map, we can adjust buffer delay at the current buffer location. Delay variation is directly interpreted as the buffer resistance variation, thus buffer resistance under the stress map can be calculated as well. Now revisit the step e) with updated buffer resistance to compensate the change of buffer resistance. Note that in this time, all the location of TSVs are fixed as the previous location to keep the same stress effect, and only wire length is adjusted, and (x, y) of merging point is changed due to the wire-length change.
From the bottom of the clock tree, by doing step a) to f) level-by-level, a buffered 3D clock tree with N dies can be constructed with minimum wire-length as well as the skew under skew tolerance of the system.
EXPERIMENTAL RESULTS
We implement the proposed CTS flow in C++, and use NanGate library and 45nm PTM model [14] to characterize variance and covariance assuming 5% inter-die and 5% intradie variation. Gurobi [16] is used as an ILP solver. Table 2 shows circuit information used for our experiments. We use the same clock sink number and TSV density for all benchmarks to focus on the trend by the various numbers of tiers to be stacked. # T.P. means the number of targeted paths for the optimization. For example, if we choose # T.P.=1, our algorithm tries to optimize the most critical path. TSV density is a percentage of occupied area by TSVs. TSV diameter is 4um and KOZ is 1um. We assume that TSV capacitance is 28ff and resistance is 0.053Ω.
First, we show that our work can provide a design guideline to reduce the stress effect on clock skew. Table 3 shows skew caused by TSV stress according to clock source zlocation. To see stress induced skew change, we do CTS without stress consideration to be zero skew, and measure the skew with the stress model. Since a bottom tier in 3D stacking does not need TSVs on silicon substrate, a clock buffer in Tier 0 does not have an effect on the stress. If a clock source is in Tier 0 (bottom tier), clock buffers tend to be concentrated on Tier 0, which can reduce skew variation on the stress. However, we can see huge increase of the skew (62.9ps) when the clock source is placed on Tier 1. For the remaining experiments, we assume that clock sources are placed in Tier 0 to show conservative results. Second, we verify the usefulness of our stress aware CTS. Table 4 compares case1 and case2 to show skew variation for all of the benchmarks. Case1 means CTS without covariance optimization and stress consideration while case2 is stress aware CTS without covariance optimization. In the table, Cov. means average covariance for the optimized paths. σ stands for standard deviation of CP . Covariance and σ are average values for all targeting paths. The comparison shows that the skew due to the stress can be up to 12.8ps for CKT 8 if we do not consider TSV stress variation during CTS. Clock period of CKT 8 can increase 2.8% from 454ps to 466.8ps. If the clock source is on Tier 1, overall clock frequency can increase more than 10% from Table 3 . Table 4 shows no penalty of clock buffers, TSVs and wire-length for stress aware CTS.
Next, our variation reduction using the ILP formulation is verified in Table 5 . CTS without stress consideration, case3 in Table 5 , shows relatively large skew caused by TSV stress because our ILP formulation enforces clock buffers on spreading more evenly over the tiers. We use β = 0 to see maximum variation reduction. β is a control parameter to avoid too many TSV insertion introduced in equation 11. Finally, combining our ILP formulation and stress modeling, we can reduce the clock period for CKT 8 at 3-σ level up to 5.7% by comparing case1 and case4.
CONCLUSIONS
For 3D-IC design, we observe two important design challenges: Variation between tiers, TSV induced stress. Interdie variation effect can be used to compensate clock path variation, which optimizes random variation. TSV induced stress is a systematic component of variation. We could reduce nominal value of clock period by considering the stress during CTS, and minimize the variation of clock period with optimal assignment of clock buffer z-location. The proposed 3D CTS can enhance maximum frequency up to 5.7% by combining the two approaches.
