# PAPER Variant X-Tree Clock Distribution Network and Its Performance Evaluations<sup>\*\*</sup>

# Xu ZHANG<sup> $\dagger$ \*</sup>, Xiaohong JIANG<sup> $\dagger$ </sup>, Nonmembers, and Susumu HORIGUCHI<sup> $\dagger$ </sup>, Member

SUMMARY The evolution of VLSI chips towards larger die size, smaller feature size and faster clock speed makes the clock distribution an increasingly important issue. In this paper, we propose a new clock distribution network (CDN), namely Variant X-Tree, based on the idea of X-Architecture proposed recently for efficient wiring within VLSI chips. The Variant X-Tree CDN keeps the nice properties of equal-clock-path and symmetric structure of the typical H-Tree CDN, but results in both a lower maximal clock delay and a lower clock skew than its H-Tree counterpart, as verified by an extensive simulation study that incorporates simultaneously the effects of process variations and on-chip inductance. We also propose a closed-form statistical models for evaluating the skew and delay of the Variant X-Tree CDN. The comparison between the theoretical results and the simulation results indicates that the proposed statistical models can be used to efficiently and rapidly evaluate the performance of the variant X-Tree CDNs.

key words: Clock Distribution Network (CDN), Variant X-Tree, X Architecture, H-Tree, clock skew

#### 1. Introduction

Clock signals that operate at the highest speed of any signals within a VLSI chip play a central role in the design of modern digital synchronous systems. The evolution of VLSI chips towards larger die size, smaller feature size and faster clock speed makes the clock distribution an increasingly important issue [10]. The clock distribution networks (CDNs), which are used to distribute clock signals to synchronize the data flows among different data paths, can significantly affect the overall system performance and reliability.

To evaluate the performance of a CDN, we usually need to study the maximum clock delay and clock skew of it, where the clock skew is defined as the difference between the maximum clock delay and the minimum clock delay among all clock paths (interconnects) in the CDN. The clock skew arises mainly from unbalanced delays due to the unequal clock path lengths between clock source and different modules as well as from process variations that cause clock path delay variations [10].

 $^\dagger {\rm The}$  author is with the Graduate School of Information Science, Tohoku University, Sendai, 980-8579 Japan

\*zhxu@ecei.tohoku.ac.jp

The critical issues concerning with the design of clock distribution network are to achieve a low clock delay and the minimum or a useful skew in most cases with the minimum buffer size and wire length. The well-balanced H-Tree CDN has been widely adopted to eliminate the skew caused by unequal clock path lengths [9], where the uncontrollable clock skew mainly comes from the variations in process parameters that affect the interconnect impedance/capacitance and, in particular, any distributed buffers or amplifiers [3], [9]. Extensive research efforts have been devoted to studying and modeling the impacts of process variations upon the clock skew of a CDN, see, for example, [9], [12], [13].

Although H-Tree is attractive for clock distribution due to its small clock skew and a relatively simple implementation, it usually results in a long clock path from the clock source to each sink (clock terminal). Thus, an H-Tree CDN usually causes a higher clock delay.

Mesh or grid is also a popular architecture for distributing clock signals on a chip. It uses inherent redundant interconnects created by loops to smooth out undesirable variations between signal nodes spatially distributed over the chip, and thus results in a lower clock skew. However, the mesh/grid CDN usually occupies a larger wiring area, and consumes more power. Such a condition is becoming worse with the increase of modern VLSI chips' area moreover.

Recently, X Architecture was proposed to wire in a VLSI chip with considerably shorter wiring length than that of traditional Manhattan wiring architecture [2]. It has been demonstrated in [18] that the X Architecture, which supports 45- and 135-degree wires as well as the vertical and horizontal wires, can reduce as high as 29% of the wire length required by the simple Manhattan wiring architecture. As a result, the X Architecture becomes promising to considerably reduce the delay and improve the overall performance of on-chip interconnects.

In this paper, we extend the X Architecture to clock distribution and propose a novel non-orthogonal clock distribution network, namely Variant X-Tree, which preserves the nice properties of equal-clock-path and symmetric structure of the typical H-Tree CDN. We will analyze the detail layout and construction features of the Variant X-Tree CDN. The simulation re-

Manuscript received January 1, 2003.

Manuscript revised January 1, 2003.

Final manuscript received January 1, 2003.

<sup>\*\*</sup>This work is partly supported by Tohoku University 21st century COE program. The partial results of this work have been published in [20].

sults show that the Variant X-Tree CDN is able to achieve both a lower maximal clock delay and a lower clock skew than its H-Tree counterpart. We also introduce a statistical performance evaluation model that is able to estimate its performance rapidly with statistical analysis method while considering the process variations. Experiment results indicate that the proposed model is effective to serve as an upper bound of performance. Moreover, it can also be integrated into design flow expediently for a set of closed-form equations are derived based on this model.

The rest of this paper is organized as follows. Some preliminaries about X Architecture are introduced in Section 2. The layout and construction features of Variant X-Tree CDN are presented in Section 3, and the statistical performance model is described in Section 4. The experiment methodology of performance evaluation is introduced in Section 5. Section 6 provides the simulation results and discussions, and finally, the Section 7 concludes this paper.

#### 2. X Architecture

Within traditional VLSI chips, interconnects have been routed using so-called Manhattan architecture, namely, only the vertical wires and horizontal wires are permitted in a chip. The X Architecture that belongs to non-Manhattan architecture [14] was proposed recently for efficient integrated circuit wiring based on the pervasive use of diagonal wires [2]. The X Architecture applies to the chips with at least five layers of metal, as illustrated in Fig.1(a), where its fourth and fifth metal layers (i.e., M4 and M5) are rotated by 45- and 135-degrees, respectively, and its M1 to M3 layers remain Manhattan and can be implemented by using available standard cells and hard IP in conjunction with the diagonal wiring.

The concept of X Architecture is simple and its benefits are clear. Compared to the traditional Manhattan architecture, it can shorten wiring by up to 17% (the maximum reduction of 29% theoretically) across a die [6] in a average case. The utilization of X Architecture is becoming popular, some VLSI chips based on X Architecture have been released (e.g., a GPU chip by ATI in 2005 and a 10Gb Ethernet chip by Teranetics in 2006) according to [2]. Design rules and EDA tools that support X Architecture are also available.

Notice that the above simple preferred-direction implementation of the X Architecture will likely increase the number of vias and cause extra interconnect delay despite the reduction in overall wire length. It was suggested recently to combine the techniques of Xaware placement and liquid routing (i.e., wiring with all eight compass directions in all layers) to take the full advantages of X Architecture [2].



Fig. 1 (a) Preferred direction in X Architecture (b) Non-preferred direction support in M4 and M5  $\,$ 

#### 3. Variant X-Tree CDN

Since the X Architecture is promising in reducing the wire length and thus the interconnect delay, we expect that the performance of clock distribution can be improved if the X Architecture is applied to the design of a clock distribution network.

To support our new CDN, in this paper, we make an assumption that all the eight directions wiring (including preferred directions and non-preferred directions) can be permitted on VLSI chip wiring. An example that non-preferred directions are also supported in M4 and M5 is illustrated as Fig.1(b).

In this section, we first introduce the basic unit of Variant X-Tree based on X Architecture. We then study the detail layout features of Variant X-Tree CDN.

#### 3.1 Basic Unit of Variant X-Tree CDN

We apply the building fashion of the well-known H-Tree CDN to the construction of new CDN based on X Architecture. It is notable, however, that we can not directly apply the X Architecture for wiring to construct an X-Tree in a recursive way as that of H-Tree, because the overlapping among the 45-degree and 135-degree wires increase sharply with the increase of supported clock sinks. Here, we propose a scalable approach to constructing large scale CDNs based on an extension of X Architecture (we refer to this new CDN as Variant X-Tree CDN hereafter).



Fig. 2 The basic unit of Variant X-Tree

The main idea of Variant X-Tree CDN is to define a basic unit (the minimum unit) for clock distribution of 4 sinks first, as portrayed in Fig.2. The parameter ain Fig.2 is just a half of the distance between two sinks and b is the offset distance relative to sinks horizontally, where the parameter b must satisfy the following inequality (1) to guarantee the connectivity between two adjacent network levels.

$$b \ge 2\sqrt{2} \cdot P_{min} \tag{1}$$

where  $P_{min}$  is the minimum permitted pitch of interconnects within a VLSI chip.

Based on the basic unit illustrated in Fig.2, we can construct a large scale Variant X-Tree CDN recursively based on the similarity of H-Tree CDNs. Fig.3 shows a level 6 Variant X-Tree CDN which distributes clock signal to  $2^6 = 64$  clock sinks.



A Variant X-Tree CDN with 64 sinks (level=6) Fig. 3

According to the Fig.2 and the Fig.3, the length of clock path that starts from clock source to each clock sink is equal obviously like H-Tree CDN. The demerit of Variant X-Tree CDN is that the distance between sinks is not always equal and it is axis-symmetry. But the isometry of Variant X-Tree CDN can be achieved approximately when the distance between sinks (geometric parameter a) becomes enough far and b is set to the minimum value.

#### Construction Features of Variant X-Tree CDN 3.2

It is necessary to study the detail construction features in order to build a large scale CDN based on the basic unit of Variant X-Tree. In this subsection we describe some important layout properties of Variant X-Tree CDN.

**Observation 1:** To generate a Variant X-Tree with a shorter clock path length than its H-Tree counterpart, the following inequality must be satisfied.

$$b < 2(2 - \sqrt{2}) \cdot a \tag{2}$$

Proof. According to the construction rule of H-Tree CDN and Variant X-Tree CDN, we can see easily that the clock path length  $({\cal L}_{H-Tree}^{(n)})$  of a n-level H-Tree CDN  $^\dagger$  is given by (3), while the clock path length  $(L_{X-Tree}^{(n)})$  of a Variant X-Tree CDN is given by (4), respectively.

$$L_{H-Tree}^{(n)} = (2^{1+n/2} - 2) \cdot a \tag{3}$$

$$L_{X-Tree}^{(n)} = \frac{1}{2}(2^{n/2} - 1) \cdot (2\sqrt{2}a + b)$$
(4)

Obviously, to obtain a shorter clock path than that of H-Tree, we need

$$L_{X-Tree}^{(n)} < L_{H-Tree}^{(n)}$$
  
at yields Inequality (2).

that yields Inequality (2).

**Observation 2:** The construction levels that can be achieved recursively in one metal layer do not exceed 2 in Variant X-Tree CDN.

*Proof.* Consider that two levels Variant X-Tree CDN has been constructed in a layer as portrayed in Fig.4, where  $h_1$  is the distance between the branch connecting next level and right-bottom branch of level-2,  $h_2$  is the distance between the left-bottom branch of level-1 and the right-bottom branch of level-2.  $h_1^{(n)}$  and  $h_2^{(n)}$  of nlevel CDN can be computed with geometry constraints,

$$h_1^{(n)} = 2^{\frac{n}{2}} \cdot \frac{\sqrt{2}}{4}b \tag{5}$$

$$h_2^{(n)} = 2^{\frac{n}{2}}\sqrt{2}a + 2^{\frac{n}{2}} \cdot \frac{\sqrt{2}}{4}b \tag{6}$$

Clearly, here  $h_1^{(n)}$  is smaller than  $h_2^{(n)}$ . Therefore, it results in a intersection when a complete level-2 Variant X-Tree CDN connects the next level in the same metal layer. It demonstrates that the next level Variant X-Tree can not be routed in the same metal layer. Thus the maximum levels of complete Variant X-Tree that can be achieved in one layer should be 2 at most. Additionally, Observation 2 can be proved intuitively from Fig.4. 

Observation 2 indicates that there are 16 clock sinks (terminals) in one metal layer at most, so we can determine the maximum number of supported clock sinks if the number of routing metal layers of CDN is specified. For examples, the sinks can be as many as 256 when Variant X-Tree CDN is wired in 2 metals, 1024 sinks for 3 metal layers and 4096 for 4 metal layers. Therefore we consider that it is enough for most of CDNs and suitable to the semi-global or global clock distribution on a VLSI chip.

<sup>&</sup>lt;sup>†</sup>H-Tree CDN can be considered as a Variant X-Tree CDN where the distance of clock sinks is  $2 \cdot a$ , b=0 and no diagonal wires are permitted. Note the vertical distance of sinks in Variant X-Tree CDN is also  $2 \cdot a$ .



Fig. 4 Wire intersection occurs in one layer when more levels are constructed

Furthermore it is notable that we can connect two 2-levels Variant X-Tree with horizontal/vertical wires rather than orthogonal wires so as to obtain 64 terminals in one metal layer without the occurrence of wire intersection. In other word, we can construct a Variant X-Tree CDN with 512 clock sinks in two metal layers by connecting two level-6 CDN using a horizontal/vertical wire. Moreover, with this idea, we can derive a hybrid CDN that combines the topology of Variant X-Tree and H-Tree and maximize the number of clock sinks of a CDN. Fig.5 shows such a CDN that adds one H-Tree level after constructing two levels of Variant X-Tree CDN. Thus we have the following corollary:



 ${\bf Fig. 5}$  ~ A hybrid CDN with 2-levels Variant X-Tree and 1-level H-Tree link

**Observation 3:** A hybrid CDN with 2-levels Variant X-Tree and 1-level H-Tree link supports 64 terminals in one metal layer.

So we can construct a larger size CDN with this hybrid Variant X+H Tree structure while making the number of metal layers routed small.

The rules mentioned above enables to choose proper type of CDN depending upon the number of clock sinks and the available wiring metal layers.

#### 4. Statistical Performance Analysis Model

As the CMOS technology advances into the nanometer feature size and multi-gigahertz regime and with the adoption of Cu-based on-chip interconnects, the performance of VLSI circuits is getting more sensitive to the process variations. Process variations can significantly impact both devices and interconnect performance so as to affect the circuit performance (especially for clock distribution network). Traditional corner-based analysis could be conservative [8]. The statistical timing analysis becomes important in the last few years. In this section, we propose a statistical performance evaluation approach for Variant X-Tree CDN in order to give designers a guideline for estimating the performance of CDN in initial design stage.

### 4.1 Statistical Performance Evaluation Model

The authors in [13] proposed a statistical skew modeling for general clock distribution network in a bottomup manner. Specially, a closed-form model of clock skew and maximum clock delay is also presented for a well-balanced H-Tree CDN.

For the Variant X-Tree CDN (even the hybrid CDN—Variant X+H Tree proposed in this paper), the statistical performance model in [13] can also be applicable because the Variant X-Tree CDN is a binary clock tree essentially and well-balanced like that of H-Tree. We thus adopt this statistical performance model to estimate the performance of Variant X-Tree CDN. The statistical performance model is concluded as follows.

In Fig.6,  $\xi$ ,  $\eta$  and  $\chi$  are defined as the maximum delay, the minimum delay and clock skew at the intersection node of two branches respectively. d is the delay of clock paths (branches of CDN) that connects different nodes. Then for a N level Variant X-Tree, let



Fig. 6 Illustration of basic unit of variant X-Tree for statistical performance analysis

 $d_i$ ,  $i = 0, \dots, N$  be the actual delay of branch *i* of a clock path. The mean values and the variances of the maximal clock delay,  $\xi$ , and the minimal clock delay,  $\eta$ , of the Variant X-Tree are then given by following equations:

$$E\left(\xi\right) = \sum_{i=0}^{N} E\left(d_{i}\right)$$

+ 
$$\frac{1}{\sqrt{\pi}} \sum_{i=1}^{N} \sqrt{\sum_{k=1}^{i} \left(\frac{\pi-1}{\pi}\right)^{k-1} \cdot D\left(d_{N-i+k}\right)}$$
 (7)

$$E(\eta) = \sum_{i=0}^{N} E(d_i) - \frac{1}{\sqrt{\pi}} \sum_{i=1}^{N} \sqrt{\sum_{k=1}^{i} \left(\frac{\pi - 1}{\pi}\right)^{k-1} \cdot D(d_{N-i+k})}$$
(8)

$$D\left(\xi\right) = D(\eta) = \sum_{i=0}^{N} \left(\frac{\pi - 1}{\pi}\right)^{i} \cdot D\left(d_{i}\right)$$

$$\tag{9}$$

The expected clock skew  $E(\chi)$  and skew variance  $D(\chi)$  of the N level Variant X-Tree CDN are given by:

$$E(\chi) = E(\xi) - E(\eta)$$
  
=  $\frac{2}{\sqrt{\pi}} \sum_{i=1}^{N} \sqrt{\sum_{k=1}^{i} \left(\frac{\pi - 1}{\pi}\right)^{k-1} \cdot D(d_{N-i+k})} (10)$ 

$$D(\chi) = D(\xi) + D(\eta) - 2 \cdot \rho \cdot \sqrt{D(\xi) \cdot D(\eta)}$$
$$= 2 \cdot (1 - \rho) \cdot \sum_{i=0}^{N} \left(\frac{\pi - 1}{\pi}\right)^{i} \cdot D(d_{i}) \qquad (11)$$

Where  $E(\cdot)$  and  $D(\cdot)$  represent the mean value and the variance of a random variable, respectively,  $\rho$  is the correlation coefficient of  $\xi$  and  $\eta$ , and  $\rho$  can be recursively evaluated for a network. The closed-form expressions (7)-(11) indicate clearly how the clock skew and the maximal clock delay are accumulated along the clock paths and with the increase of Variant X-Tree CDN's size.

#### 4.2 Variance Estimation for a Branch in CDN

The delay variance of a clock path can be determined in term of variances of these independent random variables. To make the problem practicable, we assume that process variations of devices and interconnect are independent. Thus we can the delay variance as described in [13].

To analyze the delay in the presence of process variations rapidly, the authors in [4] proposed a statistical model that discards higher order terms while does not result in a loss of accuracy. The proposed models enable closed-form computation of means and variances of interconnect delay for given magnitudes of relevant process variations. Therefore we can also adopt this model for efficient analysis (means and variances) of interconnect delay.

Note that considering the correlation of parameters can lead to more accurate result, but this correlation is neglected here for simplified computation. According to [5], [13] the correlation will lead the path delay in the same chip tending to be a positive dependent. Neglecting it will guarantee that the expected values of skew and maximal clock delay are still kept as a upper bound. It will be shown that the model is effective by the experiment results in Section 6 under such a case.

#### 5. Methodology of Performance Evaluation

To verify the performance improvement of Variant X-Tree CDN and the proposed statistical performance model, we conducted some simulations in presence of process variations. In this section, we first introduce the considerations about process variations and the delay calculation method, we then describe the simulation setups and parasitics extraction method related to delay calculations. Finally, we show the parameters setting issue.

#### 5.1 Process Variations

In the manufacturing process of a VLSI system, some uncertainties (process variations) may arise due to the parameter fluctuations of devices or environment, which make the overall performance of the system varies with these inherent and unavoidable fluctuations. In general, the parameter fluctuation consists of interdie parameter fluctuation and intra-die parameter fluctuation. The former one is the results of lot-to-lot and wafer-to-wafer variations of parameters related to equipment properties, wafer polishing and wafer placement, and it usually affects every element on a chip On the other hand, the intra-die parameequally. ter fluctuation, such as the resist thickness fluctuation across wafer and the aberrations in stepper lens, usually affects the elements of a chip unequally and it produces a non-uniformity of electrical characteristics across the chip.

The process variations may affect both the geometry parameters of devices (e.g. inverter) and the geometry parameters of interconnects (such as length, width and thickness) in VLSI systems. In nanoscale process or deep sub-micromicron (DSM) process, the parameter fluctuations impose a growing threat to the system performance, especially for the gigascale interconnection systems where the polysilicon gate length has decreased below the wavelength of light used in the optical lithography process [7]. It is predicted that in a 130nm technology [17], the variation magnitude in gate length of n-MOS and p-MOS can be as high as 35% (specified by the fraction  $3\delta/\mu$ , where  $\delta$  and  $\mu$  are the standard derivation and mean of gate length, respectively).

In this paper, we consider both the interconnect parameters variation and device parameters variation in our analysis. For a parameter, its variation  $\sigma$  can be generally modeled as:

$$\sigma = \sigma_{Inter-die} + \sigma_{Intra-die,global} + \sigma_{Intra-die,local} + \varepsilon$$
(12)

where  $\sigma_{Inter-die}$ ,  $\sigma_{Intra-die,global}$ ,  $\sigma_{Intra-die,local}$  are its inter-die variation, location-dependent global intradie variation and local intra-die variation, respectively, and  $\varepsilon$  is a random component.

# 5.2 Interconnect Delay Calculation

As the interconnect length and operating speed entered the nanoscale regime and gigascale regime, respectively, the inductance component becomes comparable to resistance component in circuits of VLSI (specially for Cu-based interconnect technology with a low resistance) [11]. Thus, the more advanced RLC model should be adopted to fully analyze the real performance of modern CDNs.

In this paper, we calculate the delay of clock path in CDN with the distributed RLC model proposed in [11] and an empirical RLC delay equation based on curve-fitting was derived as Equation (13):

$$t_{50\%} = (e^{-2.9\zeta^{1.35}} + 1.48\zeta)/\varpi_n \tag{13}$$

where

$$\varpi_n = \frac{1}{\sqrt{L_{int}(C_{int} + C_L)}} \tag{14}$$

$$\zeta = \frac{R_{int}}{2} \sqrt{\frac{C_{int}}{L_{int}}} \frac{R_T + C_T + R_T C_T + 0.5}{\sqrt{1 + C_T}} \quad (15)$$

$$R_T = R_s / R_{\rm int} \tag{16}$$

$$C_T = C_L / C_{\rm int} \tag{17}$$

 $C_L$  is the load capacitance and  $R_{int}$ ,  $C_{int}$ , and  $L_{int}$  are the total line resistance, capacitance, and inductance, respectively.

Similar to H-Tree CDN, a Variant X-Tree CDN can also be represented by a binary tree, and the total delay of a clock path from source to a sink can be calculated by summarizing the delay of all branches along this path. For a Variant X-Tree CDN, we make a suppose that an inverter is inserted into each branch to drive the downstream interconnect (see an example illustrated in Fig.3, inverters in different branches are not illustrated). Thus, we can apply the Equation (13) for RLC delay calculation.

#### 5.3 Simulation Setup

To fully investigate the performance improvements of Variant X-Tree CDN compared to H-Tree CDN in presence of process variations, we consider here two types of CDNs: a Variant X-Tree CDN (we refer to it as CDN-X hereafter) and an H-Tree CDN (abbreviated as CDN-H hereafter). Both of them are oriented global clock distribution and used to distribute clock signals to different processor elements (PEs) in a SoC/NoC chip. For each clock signal wire in both CDN-X and CDN-H, a ground line and a power line will be placed on either side of it as illustrated in Fig.7.



Fig. 7 Routing structure of clock wire and P/G wires

Since the main target of this paper is to investigate the effects of process variations and inductance upon the H-Tree CDNs, so we assume in our simulation that the power grid is ideal (i.e., it is free of IR-drop,voltage fluctuation, etc).

The main steps for simulation are summarized as the follows.

- (1) Determine the layout of H-Tree CDN.
- (2) Generate independent random data set that follows the Gaussian distribution.
- (3) Map the random data to fluctuations of physical dimensions of interconnects and device parameters.
- (4) Compute electrical parameters.
- (5) Compute the delay of each clock path in the clock distribution network.
- (6) Find the minimum/maximum delay and skew.
- (7) Evaluate the mean values of the maximum delay and clock skew, and their standard deviations.

To get a stable estimations of the maximum clock delay and clock skew, all simulations are conducted one million times.

#### 5.4 Parasitics Extraction

The Equations (4)-(7) indicate that we need to extract the parasitic parameters (resistance, capacitance and inductance) for the evaluation of interconnect delay.

The interconnect resistances  $R_{int}$  can be simply evaluated as  $R_{int} = r \cdot l/w$ , where r, l and w are the resistance of unit length interconnect, the interconnect length and interconnect width, respectively.

For the evaluation of capacitance, we adopt a quasi-3D on-chip capacitance model proposed in [16] to calculate interconnect capacitance. The main idea of this capacitance model is to decompose a 3D wire structure into a series of 2D segments to achieve an efficient and accurate capacitance extraction for the 3D wire. The capacitance between crossover wires (even for non-orthogonal wires) can be calculated using Equation (18) with the proposed concept effective width  $(W_{eff})$ :

$$C_{cross} = W_{eff}(90^{\circ}) \csc(\phi) C_{self} \tag{18}$$

where  $\phi$  is the rotation angle of crossover wires. The authors have shown that an excellent agreement exists between their results and that of the 3D capacitance solver.

Finally, we need calculate the inductance used in the RLC delay model. Notice that it is usually formidable to extract the accurate interconnect inductance, because the current return path is very complicated in a real chip. To make the evaluation of interconnect inductance tractable, we adopt here the formulas proposed in [15] to extract the inductance. Thus, the loop inductance ( $L_{loop}$ ) of the clock signal wire interconnects is given by following Equation (19) depending upon their routing structure:

$$L_{loop} = L_{self\_clock} + L_{self\_power} - 2M_{clock\_power}(19)$$

where  $L_{self\_clock}$  and  $L_{self\_power}$  are the self inductance of clock and power wire, respectively, and they can be evaluated by the Equation (19). The  $M_{clock\_power}$  is the mutual inductance between the clock and power wires, and it can be calculated using the formulas of mutual inductance proposed in [15].

#### 5.5 Parameters Setting

For the simulation of CDN-X and CDN-H used for global clock distribution to different processor elements (PEs), we assume the geometric parameters of CDN-X as a=0.12mm and  $b=10\mu m$ , the distance between PEs in CDN-H is set to  $0.24mm^{\dagger}$ . The input capacitance (load capacitance) of each PE is assumed to be 0.1pF [13], and the width of power/ground wire is assumed to be the same as that of the clock signal wire. The pitch of power line and ground line is set to be  $1.5\mu m$ .

Our simulation will be conducted based on a 70nm Cu CMOS technology under the Berkeley Predictive Technology Model (BPTM) [1]. We suppose that a 100X size inverter is inserted into each branch of an H-Tree CDN, and the mean values and standard deviations of some key parameters are summarized in Table 1. Where  $t_{ox}$  is the gate oxide thickness,  $\mu$  is the charge

 Table 1
 Mean values and standard deviations of major process parameters

| Parameter | $V_{TN}(\mathbf{V})$ | $V_{TP}(\mathbf{V})$ | $t_{ox}(\overset{\circ}{A})$ | $\mu_N(cm^2/V\cdot s)$  |
|-----------|----------------------|----------------------|------------------------------|-------------------------|
| Mean      | 0.2                  | -0.22                | 25                           | 600                     |
| SD        | 0.01                 | 0.011                | 12.5                         | 30.0                    |
| Parameter | t(nm)                | w(nm)                | $f_{ox}(nm)$                 | $\mu_P(cm^2/V \cdot s)$ |
| Mean      | 600                  | 450                  | 800                          | 140                     |
| SD        | 30                   | 22                   | 40                           | 7.0                     |

carrier mobility,  $V_T$  is the threshold voltage, the interconnection line is with width w and thickness t on an oxide layer of thickness  $f_{ox}$ . The calculation of output resistance  $R_s$  of inverter is the same as that of [13], and the evaluation of parasitics capacitance  $C_p$  is based on the equations proposed in [19].

The fluctuations of geometry parameters of inverter and wire are all considered as the normal distribution and the magnitude of process variations is set to a conservative value  $15\%^{\dagger\dagger}$ , i.e., the standard deviation of a parameter is 5% of its nominal value.

### 6. Simulation Results

In this section, we first investigate the performance improvements of Variant X-Tree CDN compared to H-Tree CDN, then we present the results data to validate the proposed statistical performance analysis model.

6.1 Performance Improvement of Variant X-Tree CDN

To study the performance improvement of Variant X-Tree compared to H-Tree CDN in presence of process variations, we simulated the mean values of the maximum/minimum clock delay and clock skew for both CDN-X and CDN-H when the RLC model is adopted.

For two types of CDN with different size (determined by the number of PEs), the mean values of maximum clock delay and minimum clock delay are illustrated in Fig.8, and the corresponding results of clock skew are summarized in Fig.9.

The Fig.8 and Fig.9 indicate clearly that the simulation results of both clock delay and clock skew of Variant X-Tree (CDN-X) are very different from that of H-Tree CDN (CDN-H). For example, we can see from the Fig.8 and Fig.9 that the mean value of the maximum delay and skew are 141.6ps and 12.2ps for CDN-X, while their corresponding values for CDN-H are 152.7ps and 13.6ps, respectively. Thus, for this example, the Variant X-Tree CDN-X can cause about 7% differences in the maximum delay estimation and about 12% difference in the clock skew estimation.

Similarly, the difference in the maximum delay is 8% and an improvement of 17% in clock skew can be observed when the number of sinks reaches 1024 (level=10).

We contribute this as the decrease of clock path length of Variant X-Tree CDN comparing with the H-Tree CDN. It can be sure that performance of CDN which adopts X-Architecture is improved. On the other hand, that delay of clock signal and clock skew almost have no change or a small alteration between two types of CDN when the number of sinks in CDN keeping a small level can also be seen from above illustrations.

<sup>&</sup>lt;sup>†</sup>Note that the distance between PEs in CDN-H is equal to the vertical distance between PEs in CDN-X.

<sup>&</sup>lt;sup>††</sup>The magnitude of process variation (please refer to the Section 5.1 for its definition) is predicted as high as 60% in [17].



Fig. 8 The maximum and minimum delay of CDNs (Variant X-Tree CDN vs. H-Tree CDN)



Fig. 9 The clock skew of CDNs (Variant X-Tree CDN vs. H-Tree CDN)

#### 6.2 Statistical Performance Evaluation

To verify the proposed statistical performance analysis model of Variant X-Tree CDN, we also perform some experiment simulations about CDN-X. In particular, we simulated the mean values and standard derivations (SDs) of both the maximum clock delay and clock skew and compared the simulation results with the theoretical results calculated from the proposed statistical model. Here, we performed the variance estimation with respect to the process variations with the method proposed in [4] for efficient calculation.

For CDN-X with different size (determined by the number of PEs), the results of mean values of the maximum clock delay and clock skew based on the proposed theoretical model and the Monte Carlo simulations are summarized in Fig.10. We provide in Fig.11 the corresponding results of standard derivations of the maximum clock delay and clock.



Fig. 10  $\,$  Mean values of the maximum clock delay and clock skew of CDN-X  $\,$ 



Fig. 11  $\,$  Standard derivation of the maximum clock delay and clock skew of CDN-X  $\,$ 

The results in Fig.10 show that there is no big difference between the maximum clock delay estimation and clock skew between the simulation results and theoretical results based on the propose statistical model. For example, even for the large CDN-X network with level 10, the estimations of the maximum clock delay are 586.6ps and 632.7ps for simulation results and theoretical results respectively, while the corresponding estimations of the clock skew are 54.2ps and 57.8ps, so the estimation differences of both the maximum clock delay and clock skew between simulation results and theoretical results are less than 7%. The above results indicate that the statistical model proposed in this paper is suitable for estimation of the maximum clock delay and clock skew.

On the other hand, the estimation results of standard derivations of the maximum clock delay and clock skew in Fig.11 show that for CDN-X network, the estimation differences of standard derivations of the maximum clock delay and clock skew are much significant than that of mean values of them. For example, the maximum estimation difference of the maximum clock delay's SD is 17% in a level 9 CDN-X network, and the corresponding results of clock skew's SD is 13%. But it is notable that the simulation results do not exceed the theoretical results, namely, the theoretical results (model) can considered as an upper bound at least.

The results in this section show that although there are higher estimation differences of standard derivations with the statistical performance model proposed in this paper, it keeps effective and can be applicable in performance evaluation of Variant X-Tree CDN, especially in pre-design stage of a CDN.

# 7. Conclusions

In this paper, we present a novel non-orthogonal CDN based on X Architecture for on-chip wiring. We also conduct simulation to validate its performance when both the process variations and inductance effects are taken into account. Our simulation results indicate that comparing to the traditional H-Tree CDN, the proposed new CDN has the potential to improve the overall clock distribution performance in terms of maximal clock delay and clock skew.

We also study the layout features of Variant X-Tree in detail. It enables to determine the proper size Variant X-Tree clock distribution network with these rules. A statistical performance evaluation model is proposed as well. Experiment simulation results show it is suitable to estimate the performance of a Variant X-Tree in design stage can also be integrated into design flow easily for its closed-form.

#### References

- [1] http://www.eas.asu.edu/ ptm/.
- [2] http://www.xinitiative.org/.
- [3] M. Afghahi and C. Svensson. Performance of synchronous and asynchronous schemes for vlsi systems. *IEEE Transactions on Comput.*, 41(7):858–872, July 1992.
- [4] K. Agarwal, M. Agarwal, D. Sylvester, and D. Blaauw. Statistical interconnect metrics for physical-design optimization. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 25(7):1273–1288, July 2006.
- [5] C. S. Amin, N. Menezes, K. Killpack, F. Dartu, U. Choudhury, N. Hakim, and Y. I. Ismail. Statistical static timing analysis: how simple can we get? In *DAC '05: Proceedings* of the 42nd annual conference on Design automation, pages 652–657, San Diego, California, USA, 2005. ACM Press.
- [6] N. Arora, L. Song, S. Shah, K. Joshi, K. Thumaty, A. Fujimura, L. Yeh, and P. Yang. Interconnect characterization of x architecture diagonal lines for vlsi design. *Semiconduc*tor Manufacturing, IEEE Transactions on, 18(2):262–271, May 2005.
- [7] K. Bowman, S. Duvall, and J. Meindl. Impact of die-todie and within-die parameter fluctuations on themaximum clock frequency distribution for gigascale integration. *IEEE Journal of Solid-State Circuits*, 37(2):183–190, February 2002.
- [8] H. Chang and S. S. Sapatnekar. Statistical timing analysis under spatial correlations. *IEEE Transactions on*

Computer-Aided Design of Integrated Circuits and Systems, 24(9):1467–1482, Sep. 2005.

- S. G. Duvall. Statistical circuit modeling and optimization. In *The 5th Intl. Workshop on Statistical Metrology*, pages 56–63, 2000.
- [10] E. G. Friedman. Clock distribution networks in synchronous digital integrated circuits. *Proc. of IEEE*, 89(5):665–692, May 2001.
- [11] Y. Ismail and E. Friedman. Effects of inductance on the propagation delay and repeater insertion in vlsi circuits. *IEEE Transactions on Very Large Scale Integration (VLSI)* Systems, 8(2):195–206, April 2004.
- [12] X. Jiang and S. Horiguchi. Optimization of wafer scale htree clock distribution network based on a new statistical skew model. In DFT '00: Proceedings of the 15th IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems, pages 96–104, Washington, DC, USA, 2000. IEEE Computer Society.
- [13] X. Jiang and S. Horiguchi. Statistical skew modeling for general clock distribution networks in presence of process variations. *IEEE Trans. Very Large Scale Integr. Syst.*, 9(5):704–717, 2001.
- [14] C.-K. Koh and P. H. Madden. Manhattan or nonmanhattan?: a study of alternative vlsi routing architectures. In *GLSVLSI '00: Proceedings of the 10th Great Lakes symposium on VLSI*, pages 47–52, New York, NY, USA, 2000. ACM Press.
- [15] X. Qi, G. Wang, Z. Yu, R. Dutton, T. Young, and N. Chang. On-chip inductance modeling and rlc extraction of vlsi interconnects for circuit simulation. In *Proceedings of the IEEE Custom Integrated Circuits Conference, 2000.*, pages 487 – 490, 2000.
- [16] S.-P. Sim, S. Krishnan, D. Petranovic, N. Arora, and C. Kwyro Lee Yang. A unified rlc model for high-speed on-chip interconnects. *IEEE Transactions on Electron De*vices, 50(6):1501–1510, June 2003.
- [17] A. Srivastava, D. Sylvester, and D. Blaauw. Statistical Analysis and Optimization for VLSI: Timing and Power. Springer, 2005.
- [18] S. L. Teig. The x architecture: not your father's diagonal wiring. In *SLIP '02: Proceedings of the 2002 international workshop on System-level interconnect prediction*, pages 33–37, New York, NY, USA, 2002. ACM Press.
- [19] N. H. Weste and D. Harris. CMOS VLSI Design, 3rd Edition. Addison Wesley, 2005.
- [20] X. Zhang, X. Jiang, and S. Horiguchi. A non-orthogonal clock distribution network and its performance evaluation in presence of process variations and inductive effects. In *GLSVLSI '06: Proceedings of the 16th ACM Great Lakes* symposium on VLSI, pages 336–340, Philadelphia, PA, USA, 2006. ACM Press.