Abstract-This paper focuses on low-power and low-slew clock network design and analysis for through-silicon via (TSV) based three-dimensional stacked ICs (3D ICs). First, we investigate the impact of the TSV count and the TSV resistance-capacitance (RC) parasitics on clock power consumption. Several techniques are introduced to reduce the clock power consumption and slew of the 3D clock distribution network. We analyze how these design factors affect the overall wire length, clock power, slew, and skew in 3D clock network design. Second, we develop a two-step 3D clock tree synthesis method: 1) 3D abstract tree generation based on the three-dimensional method of means and medians (3D-MMM) algorithm; 2) buffering and embedding based on the slew-aware deferred-merge buffering and embedding (sDMBE) algorithm. We also extend the 3D-MMM method (3D-MMM-ext) to determine the optimal number of TSVs to be used in the 3D clock tree so that the overall power consumption is minimized. Related SPICE simulation indicates that: 1) a 3D clock network that uses multiple TSVs significantly reduces the clock power compared with the single-TSV case, 2) as the TSV capacitance increases, the power savings of a multiple-TSV clock network decreases, and 3) our 3D-MMM-ext method finds a close-to-optimal design point in the "TSV count versus power consumption" tradeoff curve very efficiently.
I. INTRODUCTION

I
N three-dimensional integrated circuits (3D ICs), the clock distribution network spreads over the entire stack to distribute the clock signal to all the sequential elements. Clock skew, defined as the maximum difference in the clock signal arrival times from the clock source to all sinks, is required to be less than 3% or 4% of the clock period in an aggressive clock network design according to the International Technology Roadmap for Semiconductors (ITRS) projection [1] . Thus, clock skew control, which was well studied in 2D ICs [2] , is still a primary objective in the 3D clock network design. However, the clock signal in 3D ICs is distributed not only along the and directions, but also along the direction using through-silicon vias (TSVs). The clock distribution network drives large capacitive loads and switches at a high frequency. This leads to an increasingly large proportion of the total power dissipated in the clock distribution network. In some applications, the clock network itself is responsible for 25% [3] and even up to 50% [4] of the total chip power consumption. Moreover, clock slew must also be taken into consideration when designing a 3D clock network, because a large clock slew may cause a setup/hold time violation. Thus, low power, skew, and slew remain important design goals in 3D clock networks. TSVS provide the vertical interconnections to deliver the clock signal to all dies in the 3D stack. The TSV count is an important factor that characterizes the physical and electrical properties of the clock network. 3D integration with TSVs has been intensively studied in both chip-to-chip and chip-to-wafer communications [5] . The fabrication and characterization of TSVs are being explored in many companies and institutions [6] . TSV reliability issues are also studied [7] .
The low-power 3D clock network design demands a thorough investigation on how the TSV count and TSV parasitics affect the clock performance. Existing work has demonstrated that the total wire length of a 3D clock network decreases significantly if more TSVs are used [8] - [11] . According to the observations made in [8] , the die that contains the clock source includes a complete tree, while other dies can have subtrees, as illustrated in Fig. 1 . A 3D clock tree that utilizes multiple TSVs tends to reduce the overall wire length as more and more TSVs are used. However, the analysis of TSV resistance-capacitance (RC) parasitics on the clock network has not been addressed in the literature. If a 3D clock tree utilizes many TSVs that have large TSV RC parasitics, the clock delay and power consumption contributed by the TSVs may increase significantly. Using more TSVs helps to reduce the wire length and thus power consumption, but the TSV capacitance increases the clock power consumed at the same time. Our experiments indicate that the larger the TSV capacitance is, the faster the clock power consumption increases when more and more TSVs are used.
In this paper, we investigate the impact of various design parameters on the wire length, clock power, slew, and skew of the 3D clock network. These parameters include the TSV count 1 , TSV parasitics, the maximum loading capacitance of the clock buffers, and the supply voltage. We also develop clock network synthesis algorithms for low-power 3D clock network design. The contributions of this paper are as follows.
• We provide an extensive study on the impact of the TSV count and TSV parasitics on the clock power. We show the "TSV count versus power dissipation" tradeoff curves for various TSV parasitic values and discuss how the TSV count and the TSV capacitance together determine the overall clock power consumption.
• We discuss the impact of the TSV count and clock buffer insertion on clock slew control. Our study shows that using multiple TSVs helps to reduce the maximum and average slew as compared with the single-TSV case. In addition, specifying an upper bound of the load capacitance for each clock buffer remains an efficient way to control the maximum slew of the 3D clock network design.
• We present an effective way to determine the optimal number of TSVs for the 3D clock tree so that the overall power consumption is minimized. Our method predicts the impact of adding a new TSV into the current clock topology on the overall power consumption during the top-down abstract tree generation. This helps to decide whether pairing of two clock nodes in different dies and using a TSV for this pair is useful for power reduction or not. Related experiments indicate that our method finds a close-to-optimal design point in the TSV count versus power consumption tradeoff efficiently as compared with a straightforward exhaustive search method. The organization of this paper is as follows. We present an overview of related work on 3D clock network design in Section II. We formulate the problem of 3D clock tree synthesis in Section III. Section IV presents our 3D clock routing algorithm. We present an extension of our 3D-MMM algorithm in Section V. Section VI presents experimental results. We conclude the paper in Section VII.
II. RELATED WORK
In 3D clock tree design and optimization, TSV planning plays an important role in constructing a low-power 3D clock network. Minz et al. [8] proposed the first work on 3D clock routing algorithms. They discovered that the total wire length decreases significantly when more TSVs are used in the 3D clock network. They also studied the thermal impact on the 3D clock network, and proposed a thermal-aware 3D clock tree synthesis method to balance the clock skew caused by the thermal variations. Kim and Kim [11] presented a 3D embedding method to reduce wire length. However, they do not consider power consumption or slew rate and do not provide any SPICE simulation results. Zhao et al. [9] developed a clock design method to support the pre-bond testing for 3D ICs. They also discussed the impact of the TSV counts on the pre-bond testable clock tree. They observed that using multiple TSVs helps a pre-bond testable 3D clock network achieve low power consumption. However, this work did not take into account the impact of the TSV capacitance on the clock power.
Pavlidis et al. [12] presented measurement data on a fabricated 3D clock distribution network. Arunachalam and Burleson [13] proposed the use of a separate layer for the clock distribution network to reduce power. Their simulations show 15%-20% power reduction over the same 2D chip clock network. However, they focused on a simple H-tree and did not perform any design-level optimization.
Due to the significant dimension of the TSVs occupying the layout space [14] , the impact of TSV parasitics, especially the capacitance, should be taken into consideration for a low-power 3D clock network design. Existing work mainly focuses on 15 fF TSVs. But the TSV capacitance can vary from tens of femtofarads up to a few hundred femto-farads, depending on the material, TSV diameter, oxide thickness, and TSV height [15] , [16] . We observe that large TSV capacitance values significantly affect the existing discussions on the TSV count versus power tradeoff. In this case, the multiple-TSV insertion reduces the total wire length and power, but the large TSV parasitic capacitance increases the power consumption. As a result, the total clock power may increase in the multiple-TSV case. Therefore, a thorough study on the impact of both the TSV capacitance and the TSV count on the overall 3D clock power is required. Given the TSV parasitic impedance, a straightforward approach to find the optimal TSV usage for low-power design is to exhaustively search the entire range of the TSV count. This approach, however, requires prohibitive design time and is thus not practical.
III. PRELIMINARIES
A. Electrical and Physical Model of 3D Clock Network
In this paper, a 3D clock network is modeled as a distributed RC network. The sink nodes that represent flip-flops, and clock input pins of IP or memory blocks, are modeled as capacitive loads. Wire segments and TSVs are modeled with a model 2 , which is a classical way to represent the parasitics of a clock network. Each buffer or driver is constructed with two inverters. Note that prior work has focused on the electrical modeling of TSVs [15] - [18] . Our 3D clock routing algorithm is flexible to handle a more complicated TSV parasitic model if desired.
The TSV bound is defined as a user specified constraint on the maximum allowed TSV number per die. The TSV bound is usually decided before clock synthesis and is based upon the process technology. Different from the TSV bound, the TSV count (#TSVs) is the total number of TSVs utilized in the 3D clock tree. For an -die 3D stack, #TSVs is usually less than or equal to bound. A three-die clock interconnect using four TSVs is shown in Fig. 2 : the clock source is located in die-3; sink in die-1 connects to the source using two TSVs that are vertically aligned; sink in die-1 connects to the source by two TSVs; and sink in die-2 uses one TSV.
B. Problem Formulation
The 3D Clock Tree Synthesis problem can be formulated as following: Given a set of sinks in all dies, a TSV bound, a predetermined clock source location, and the parasitics of the wires, buffers, and TSVs, the 3D clock tree synthesis constructs a fully-connected 3D clock network such that 1) clock sinks in all dies are connected by a single tree, 2) the TSV count in each die is under the TSV bound, 3) the clock skew is minimized (and zero under the Elmore delay model [19] ), 4) clock slew is below the constraint, and 5) the wire length and clock power are minimized.
The clock skew is the maximum difference among the arrival times at the clock sinks. In the existing clock tree synthesis tools, the Elmore delay model is a popular measure of the RC delay and skew. The primary goal of our 3D clock tree synthesis is to guarantee a zero Elmore-skew clock network. In order to achieve more accurate timing information and to evaluate our clock synthesis performance, we use SPICE simulation on our 3D clock trees. The simulated clock skew is constrained to less than 3% of the clock period. The clock slew is defined as the transition time from 10% to 90% of the clock signal at each sink.
The TSV bound constraint plays an important role in achieving low-power 3D clock networks. It reflects the impact of TSV usage on routing congestion, capacitive coupling, stress-induced manufacturing issues, and so on. By varying the TSV bounds, we obtain different 3D clock networks with different qualities. Note that the TSV bound and the actual TSV usage in each die may be different, because the bound only puts the limit on the maximum TSV usage for each die.
IV. 3D CLOCK TREE SYNTHESIS
A. Overview
Our 3D clock tree synthesis algorithm consists of two major steps: 1) 3D abstract tree generation and 2) slew-aware buffering and embedding. First, we generate a 3D abstract tree based on our 3D method of means and medians (3D-MMM) algorithm. The 3D-MMM algorithm basically determines which pair of nodes (sink nodes or merging points) to connect together and utilizes TSVs if necessary, while building a binary tree in a top-down fashion. Note that our 3D-MMM algorithm works in such a way that there is always one die that contains a single tree which connects all sinks in the die, whereas the sinks in other dies are connected with multiple trees. In this case, the clock source is located in the die that contains the single tree.
Once a 3D abstract tree is obtained, we determine the routing topology and exact geometric locations for all the nodes, TSVs, and buffers. Our slew-aware deferred-merge buffering and embedding (sDMBE) method is a two-phase approach, which is based on the classic deferred-merge and embedding (DME) algorithm [20] in 2D clock routing. sDMBE first visits each node in a bottom-up fashion, determines the merging type for a pair of subtrees, inserts buffers if necessary, and calculates the merging distances based on the zero-Elmore-skew equations. The outcomes of sDMBE during the first phase are the merging segments, which store a collection of feasible locations of the internal nodes in the 3D abstract tree. During the second phase, sDMBE visits the whole abstract tree in a top-down manner while deciding the exact merging locations of the internal nodes, buffers, TSVs, and exact routing topology until all sinks are connected via a single tree.
B. 3D Abstract Tree Generation
The first step of our 3D clock tree synthesis is the 3D abstract tree generation using the 3D-MMM algorithm. A 3D abstract tree indicates the hierarchical connection information among the sink nodes, internal nodes, TSVs, and the root node. The 3D abstract tree of an -die stack clock network is an -colored binary tree, which is used to identify the die index for all the nodes.
We develop the 3D-MMM algorithm to generate a 3D abstract tree for the given clock sinks in a top-down manner, which is an extension of the method of means and medians (MMM) algorithm [21] . The 3D abstract trees generated by the 3D-MMM algorithm with various TSV bounds are shown in Fig. 3 . Note that a larger TSV bound tends to move TSVs closer to the sink nodes and causes more vertical clock connections than horizontal connections. However, the overall wire length is reduced due to the short horizontal connection length. The basic idea of our 3D-MMM algorithm is to recursively divide the given sink set into two subsets until each sink belongs to its own set. A TSV is used if we decide to merge a pair of nodes in different dies. In this case, our goal is to evenly distribute the TSVs across the die area and to satisfy the given TSV bound, which is shown to improve manufacturability [22] .
Let denote a set of sinks, where the locations of the sinks have been decided before the 3D clock tree synthesis. We assume that the maximum allowed TSV count for each die in (TSV bound) is also given. Each is a triplet of , where is the die index of , and and are the and coordinates of . Let denote the number of dies the sinks in set are located in. In each recursive partitioning, we divide the set into two subsets and based on the following two cases. • -: if the TSV bound is one, the given sink set is partitioned such that the sinks from the same die belong to the same subset. The connection between and needs one TSV in-between adjacent dies. Note that 3D-MMM is a bi-partitioning process. When the sinks of distribute to more than two dies (i.e., ), we need iterations of -direction partitions to split the sink set into subsets, so that the sinks belonging to the same die are in the same subset. Furthermore, the order of the -cut also depends on the source die index. Fig. 4 shows the details of the -cut procedure.
• -: if the TSV bound is larger than one, or the sinks in the set belong to the same die, the set is partitioned geometrically by a horizontal line ( -cut or -cut), and -dimension is ignored. If the subsets contain sinks from different dies, we potentially need multiple TSVs to connect those sinks. At the end of each partitioning, we propagate the TSV bound constraint by assigning a TSV bound for each new subset. The 3D abstract tree generation using the 3D-MMM algorithm is shown in Fig. 5 . The recursive method takes as inputs a set of 3D clock sinks and a TSV bound. If the size of the given sink set (i.e., ) is one, then we reach the bottom level of the abstract tree (lines [3] [4] . If the TSV bound is one, -cut is applied to partition the sink set into two subsets and (lines 6-7). As previously discussed, once the TSV bound is one, our 3D-MMM performs -direction partitions, so that the sinks belong to the same die are in the same subset. In order to guarantee that only one TSV is used between adjacent dies, the order of die-wise -cut depends on the source-die index and the die indices in the sink set , as illustrated in Fig. 4 . In the case that the above conditions are not satisfied, the set is partitioned geometrically by a horizontal line ( -cut or -cut), so called -cut (line 9). And the -dimension of each sink is ignored. The cut line is drawn at the median of the or coordinates of the sinks. The TSV bound is divided for the two subsets (line 10). The bound for each subset is calculated by Fig. 6 . Three-colored 3D abstract trees after applying Z-cut twice on the three-die stack sink set fa; b; cg, when the clock source is located in (b) die-3, (c) die-2, and (d) die-1. Each node in the abstract tree contains the corresponding sink set and a color index (b) first applies Z-cut and then Z-cut , whereas (d) applies Z-cut first and then Z-cut . 1) estimating the number of TSVs required by each subset and 2) dividing the given bound according to the ratio of the estimated TSVs. For each subset, we assume the minimum sink size in each die as the estimation of the number of TSVs. The procedure is called recursively for each of the subsets and with different TSV bounds (lines [11] [12] . The roots of the subtrees are connected by the root of the higher-level tree (lines [13] [14] [15] . The complexity of the algorithm is , where is the number of nodes. Corresponding to the -die stack clock sinks, the 3D abstract tree is an -colored binary tree, where each node (i.e., sinks, internal nodes, and the root) is assigned a color to represent which die it belongs to. The dies are numbered from 1 to from the bottom to the top. Let be the color index for node , where . For example, means that node is located in die-1. Let denote the die index, where the clock source is located. During the top-down 3D abstract tree generation, we color the nodes corresponding to the sink sets. Considering the node with the sink set , let and be the maximum and minimum die indices of the sinks within set . The color of is determined as follows:
if is the root else if else if otherwise.
(1)
Considering an edge with two terminal nodes and . The following are true: 1) if , the edge will be routed in the same die as node and ; 2) if , then number of TSVs will be inserted along the edge . Fig. 6 shows an illustration, where 3D abstract trees for a sink set are shown after applying -cut twice. Fig. 6(b)-(d) are the three abstract trees, where the clock source is located in die-3, die-2, and die-1, respectively. Each node in the abstract tree contains the sink set and color information. The abstract trees in Fig. 6(b) is obtained by -cut first and then -cut . Whereas, Fig. 6 (d) applies -cut first and then -cut . Fig. 6 (c) first extracts the sinks of the clock source die, and then applies a -cut. The primary goal of using a different -cut sequence is to guarantee that only one TSV is necessary between adjacent dies after -.
C. Slew-Aware Buffering and Embedding
The second step of our 3D clock tree synthesis is slew-aware buffering and embedding: Given a 3D abstract tree, the goal is to determine the exact geometric locations of all the nodes, TSVs, and buffers, such that the wire length of the embedded and buffered clock tree is minimized, the load capacitance of each buffer does not exceed the predefined maximum value (CMAX), and the clock skew is zero under the Elmore delay model. We develop the slew-aware deferred-merge buffering and embedding (sDMBE) algorithm to geometrically embed (route) the abstract tree.
sDMBE is a two-phase algorithm and is based on the deferred-merge embedding (DME) algorithm [20] , which has been widely used in 2D clock synthesis. The first phase in sDMBE is to determine the merging types and to construct the merging segments for each pair of subsets in a bottom-up traversal. Different from the existing 2D synthesis [23] - [25] , which focused on slew-aware buffer insertion after clock routing, sDMBE performs buffer insertion during the bottom-up procedure. The goal of slew-aware buffering in sDMBE is to locate buffers while merging subsets, so that the load capacitances of buffers are within the given bound (CMAX). The impact of CMAX on the 3D clock slew is discussed in Section VI-E. Merging segments are obtained based on the merging distances, which are computed under the zero-skew equations in the Elmore delay model and wire length minimization goals. The second phase of sDMBE is to decide the exact locations of internal nodes, buffers, and TSVs in a top-down fashion and determine the routing topology of the overall clock nets. The complexity of our approach is , which makes it feasible for incremental clock routing or inclusion in a solution search framework.
Two samples of merging segments for unbuffered and buffered 3D clock trees are shown in Fig. 7 . When merging child nodes and to parent node , sDMBE first decides the merging type based on the given 3D abstract tree and the CMAX constraint. Corresponding to the merging type among clock wires, buffers, and TSVs, we obtain the merging distances between nodes and , and in Fig. 7(a) , the distances between node and buffer , buffer and node , nodes , and in Fig. 7(b) .
V. EXTENSION OF 3D-MMM ALGORITHM
As illustrated earlier in Fig. 1 , the overall wire length of the 3D clock tree reduces as more and more TSVs are used. Fig. 8 provides another demonstration that higher usage of TSVs leads to shorter wire length. This raises an important question: what is the optimal number of TSVs for a 3D clock tree that leads to the minimum possible power consumption? One obvious way to answer this question is by trying all possible TSV counts and choosing the best power result (an exhaustive search). This method, however, is very time consuming and requires prohibitive runtime as shown in Table I . Thus, our goal is to find this TSV count that leads to the minimum (or close-to-minimum) power result in much shorter runtime. This calls for careful attention to the impact of the TSV count not only on the overall wire length but also the total number of buffers and total TSV capacitance as these factors equally affect the overall power consumption.
We develop our new low-power 3D clock tree synthesis method, named 3D-MMM-ext, by extending our 3D-MMM algorithm presented in Section IV-B. The goal of the 3D-MMM-ext is to construct a low-power clock network by wisely assigning clock TSVs during the 3D abstract tree generation. In each top-down partition, let be the current sink set. Let denote the vertical distance the set spans, which can be expressed as (2) where and are the maximum and minimum die indices of the sinks within set . Note that also indicates the minimum number of TSVs required by the clock network connecting all the sinks in . Different from the 3D-MMM algorithm, which decides the cut direction ( -cut or -cut) based on the TSV bound (lines 5 and 8 in Fig. 5 ), the key technique of 3D-MMM-ext is to determine the cutting orientation of Fig. 9 . The 3D-MMM-ext algorithm performed on a two-die stack with the sink set S. We show the 3D abstract trees, cut orders, and the subsets from case-1 and case-2 style partitions. (a) Case-1, where we apply Z-cut at the current iteration, and then X=Y -cut and X=Y -cut in die-1 and die-2, respectively. (b) Case-2, where we apply X=Y -cut at the current iteration, and then Z-cut and Z-cut . P and P are the cost of merging S and S in (a) and in (b), respectively. the current iteration (i.e., -cut or -cut) by looking ahead to the next cutting iteration, while estimating and comparing the costs of the following two cases.
• Case-1: apply -cut at the current iteration, and then apply -cut on each die once in the following iterations; • Case-2: apply -cut at the current iteration, and postpone -cut to the next iteration. Note that for the -die stack case, -cut means applying die-wise partitions in multiple iterations until the sinks having the same die index are partitioned into the same subset. In the case-1 style partition, the sink set has times -and times -. in the case-2 has one -cut and -. Let and represent the subsets after case-1 and case-2 style partitions, respectively. The sinks within the set (or ) are in the same die. Fig. 9 shows an example of determining the current cut direction using the 3D-MMM-ext on the sink set . Fig. 9(a) shows the case-1 style partition, where -cut is applied during the current iteration and then -cut and -cut are applied on die-1 and die-2, respectively. Fig. 9(b) illustrates the case-2 partition results. We also show a part of the 3D abstract tree corresponding to case-1 and case-2 partitions, respectively. We have the following relation:
By comparing the cost of case-1 and the cost of case-2 , the cut direction of the current iteration is determined as follows:
This means that if selecting -cut during the current iteration helps reduce power, then we choose -cut; otherwise, we choose -cut. The cost is defined as follows:
Let represent either or . The first item in the cost function is the cost of the subset , where covers the final subsets after the look-ahead partitions. The second item in the cost function is the cost of connecting subsets and .
mainly comes from TSVs, global wires, and buffers. Therefore, covers all pairs of subtrees in the 3D abstract tree, where we merge those final subsets to their parent sink set during the bottom-up traversal.
Considering the two-die stack examples in Fig. 9 , and can be expressed as follows:
To estimate the cost for each sink set, we use the half-parameter wire length model for and . Then, is estimated as follows.
• If no TSV is required to connect and (9) where is the distance between the centers of subsets and . In Fig. 9 , , , and belong to this case.
• If TSVs are needed to provide inter-die connection between and : (10) where is the TSV capacitance, is the unit-length capacitance of the clock line, and is an estimator representing the cost of TSV insertion. We use the following empirical equation to calculate : (11) where , 0.05 and 0.1 if the TSV capacitance is 15 fF, 50 fF, and 100 fF, respectively. In Fig. 9 , , , and belong to this case.
VI. SIMULATIONS AND DISCUSSIONS
We first examine a two-die stack to investigate the impact of the TSV count and TSV parasitics on clock power consumption.
Next, we show the efficiency of the 3D-MMM-ext algorithm in finding the optimal number of TSVs to be used for minimum power consumption. We then present the results of our clock slew control method. Lastly, we show the impact of scaling the supply voltage on 3D clock power consumption. We validate our claims with SPICE simulation results.
A. Simulation Settings
We construct a zero-Elmore-skew 3D clock network by using the 3D clock tree synthesis methods developed in Sections IV and V. We then extract the netlist of the entire 3D clock network for SPICE simulation. After the simulation, we obtain highly accurate power consumption and timing information of the entire clock network. Note that our 3D clock tree has zero skew under the Elmore delay model, but may have nonzero clock skew from SPICE simulation. Thus, we constrain the SPICE clock skew to be less than 3% of the clock period at a frequency of 1 GHz. The slew is constrained within 10% of the clock period. Clock power mainly comes from the switching capacitance of the interconnect, sink nodes, TSVs, and clock buffers.
The technical parameters are based on the 45 nm Predictive Technology Model [26] : per unit-length wire resistance is 0.1 , and per unit-length wire capacitance is 0.2 fF/um. The buffer parameters are: driving resistance is 122 , input capacitance is 24 fF, and intrinsic delay is 17 ps. The TSV resistance is 35 . In order to study the impact of the TSV RC parasitics on the 3D clock network, we vary the linear oxide thickness and choose three typical TSV capacitance values (i.e., 15 fF, 50 fF, 100 fF). The supply voltage is set to 1.2 V unless otherwise specified. The maximum load capacitance of each clock buffer, denoted CMAX, is set to 300 fF for slew control unless otherwise specified.
Our analysis focuses on two-die and six-die 3D clock networks. In the six-die case, the clock source is located in the middle die (die-3) as suggested in [10] , unless otherwise specified. As a result, die-3 in a six-die clock network contains a complete tree. The IBM benchmarks to [27] are used. Since to are originally designed for 2D ICs, we randomly distribute the sinks into two or six dies. We then scale the footprint area by to reflect the area reduction in the 3D design. Sample clock trees in die-1 and die-3 of a six-die 3D clock network are shown in Fig. 10 . The triangle denotes the clock source in die-3. Each die contains up to 20 TSVs. Note that die-3 has a single global tree that connects all the sinks, and die-1 contains multiple local trees that are connected to the clock source using TSVs.
B. Impact of TSV Count and Parasitic Capacitance
To investigate the impact of the TSVs on clock power consumption, we use a two-die stack implementation of the biggest benchmark , which has 3101 sink nodes with input capacitances varying from 30 to 80 fF. Fig. 11 shows three clock power trend curves for a TSV capacitance of 15 fF, 50 fF, 100 fF, respectively. On the -axis we show the total number of TSVs used in each entire 3D clock tree, which is obtained Fig. 10 . Clock trees in die-1 and die-3 of a sample six-die 3D clock network, where the clock source is located in die-3. Black dots denote TSVs. The TSV bound is set to 20. Die-1 contains many local trees, whereas die-3 contains a single global tree. Fig. 11 . Impact of the TSV capacitance and count on clock power for the two-die r5. The TSV capacitance (C ) is set to 15 fF, 50 fF, and 100 fF.
Our baseline is the clock tree that uses one TSV between adjacent dies. For each C , we show the 3D-MMM results by sweeping the TSV count. We also highlight the 3D-MMM-ext results for each C , which are marked as stars near to the trends.
by imposing a different TSV bound. Our baseline 3D clock network contains only one TSV between adjacent dies. The clock power is affected by both the TSV count and the TSV capacitance as shown in Fig. 11 . First, using 15 fF TSVs in the clock network construction, the clock power decreases significantly when more TSVs are used. We are able to obtain a low-power clock network design by relaxing the TSV bound. We can achieve up to 17.0% power reduction as compared to the single-TSV case. The power savings mostly comes from wire length reduction, because the clock wire capacitance significantly affects the overall power consumed by the clock network. When more TSVs are used, the number of local trees in the non-source dies increases while their size decreases. This means that the multiple-TSV case encourages more local clock distribution in 3D designs while reducing the overall wire length. Second, if the TSV has a large capacitance (e.g., 50 fF, 100 fF), the contribution of the TSV capacitance to the overall power consumption is non-negligible. As a result, when the TSV count increases, the overall clock power reduction becomes slower. Particularly, if the TSV capacitance is 100 fF, clock power does not decrease when the TSV count exceeds a certain amount and eventually starts increasing. In this case, the clock power from the TSV capacitance increases faster than the power decreases from wire length reduction.
From this trend study, we conclude that given a TSV parasitic capacitance, there exists an optimum number of TSVs that results in the minimum 3D clock power. This in turn allows us to choose the right TSV bound for a given power budget. If a power savings of 10% is required for using the 15 fF TSVs, the TSV bound of 300 can be used based on point in Fig. 11 .
C. Exhaustive Search Results
A straightforward way to find the "min-power TSV count," i.e., the number of TSVs used in a 3D clock tree that leads to the minimum overall clock power consumption, is to exhaustively sweep the TSV bound from 1 to infinity 3 , constructing and simulating the entire 3D clock network corresponding to each TSV bound. By plotting the TSV count versus power trend curve, we are then able to find the optimum solution. Fig. 12 shows the clock power trend based on 1137 3D clock trees we generated and simulated for the two-die stack . We assume the TSV parasitic capacitance is 100 fF.
We observe that the lowest power comes from the clock network that uses 250 TSVs, with 1.190 W clock power and 2 004 250 wire length. In addition, we observe that the exhaustive search result agrees with the TSV count versus power trend we presented in the previous section, although power fluctuates locally in a small range of the TSV count. If the TSV count exceeds 600, the clock power is much more sensitive to the TSV count increase. Using one more TSV may lead to the clock power increasing or decreasing by 1%. This is because, when using a large amount of TSVs, the clock network has a large number of smaller local trees. This means that the TSV capacitance itself is comparable to or even larger than that of a single local clock tree. In this case, using a few more TSVs leads to a large fluctuation in clock power.
The proposed exhaustive search method does allow us to find the min-power TSV count, but it is too costly in terms of runtime. The smaller step size we use for the TSV count in the search, the lower power of a 3D clock network we find, but more simulations as well as runtime are required. Note that the typical SPICE simulation time of a two-die clock network is around 200 s. Repeating this 1137 times is prohibitive.
D. 3D-MMM-ext Algorithm Results
In Fig. 12 , the star indicates the solution obtained by our 3D-MMM-ext algorithm. Our algorithm does not involve any exhaustive search on the TSV count, but relies on our look-ahead based method to control the TSV usage and to minimize the overall power consumption. We observe that our 3D-MMM-ext generates a 3D clock tree that has a similar quality as the one obtained by the exhaustive research, but at a fraction of runtime. The runtime required for 3D-MMM-ext is comparable to that of generating a single 3D clock tree. The solution quality obtained by our 3D-MMM-ext algorithm can also be seen in Fig. 11 , where the stars indicate the 3D trees produced by 3D-MMM-ext. The power consumption at these points is comparable to the minimum power solutions found in each curve. Table I presents more detailed comparisons of wire length , buffer count (#Bufs), clock power (W), clock skew (ps), number of simulations (#sims), and the total simulation runtime (s) between the exhaustive search and the 3D-MMM-ext algorithm. We use two-die 3D stacks. We also show the wire length and power reduction of 3D-MMM-ext with respect to the exhaustive search. First, the clock power of 3D-MMM-ext is comparable to that of the exhaustive search. In most cases, 3D-MMM-ext has less than 1% power difference. In some cases, 3D-MMM-ext achieves even lower power (i.e., positive reduction) than the exhaustive search. This is mainly because the low-power design obtained by the exhaustive search depends on the sweeping granularity and simulation times. Second, the simulation runtime comparison reveals the effectiveness of our 3D-MMM-ext algorithm. 3D-MMM-ext requires only a single simulation, whereas the exhaustive search requires 29-41 simulations.
Tables II and III list the comparison between using a single TSV and using multiple TSVs (obtained with 3D-MMM-ext algorithm) cases. We use a two-die and a six-die implementation of our benchmark designs. First, the 3D-MMM-ext is able to find the low-power 3D clock trees. For the two-die stacks in Table II , the 3D-MMM-ext reduces the clock power by around 16.1%-18.8%, 10.3%-13.7%, and 6.6%-8.3% as compared with the single-TSV cases, and achieves wire length savings around 24.0%-26.5%, 23.9%-26.6%, and 16.6%-18.9%, when the TSV capacitance is 15 fF, 50 fF, and 100 fF, respectively. In the case of six-die stacks shown in Table III , our 3D-MMM-ext reduces power by up to 36.1%, 26.4%, and 9.1%, and reduces wire length by up to 50.7%, 47.4%, and 17.3%. Table IV lists the comparisons between placing the clock source in die-1 and in die-3, for six-die stacks using the 3D-MMM-ext algorithm. When moving the clock source to the -3) , the 3D-MMM-ext achieves further power savings, especially in the case when the TSV capacitance is 100 fF. In addition, in most of the cases, e.g., the six-die stacks with 15 fF and 50 fF TSVs, the middle-die 3D-MMM-ext uses fewer TSVs and achieves lower power than the cases when the src is in die-1.
In most cases, the simulated clock skew is less than 20 ps, which is less than the 30 ps constraint. In the case of the six-die 3D stack of , Fig. 13 shows the spatial distribution of the propagation delay for the die containing the clock source. The TSV count is 3497. We observe that the clock skew among the six dies varies within . The skew of the entire 3D clock network is 21.4 ps. Referring to the TSV RC parasitics and the 300 fF CMAX constraint, the delay along each TSV is in the order of 0.01 ps. Compared with the src-to-sink delay, this means that the TSV itself contributes a negligible portion of delay to the entire src-to-sink delay. Note that our 3D clock tree synthesis algorithm builds a zero-skew tree under the Elmore delay model, which in practice shows discrepancy between SPICE simulation results.
E. Low-Slew 3D Clock Routing
Our goal in this experiment is to show that the TSV count also affects the clock slew distribution. Fig. 14 shows the slew distribution of the six-die 3D clock tree for among all sinks. The clock slew constraint is set to 100 ps, which is 10% of the clock period. The slew distribution of the single-TSV clock tree is shown in Fig. 14(a) , whereas Fig. 14(b) shows the slew distribution of the multiple TSV clock tree using the 3D-MMM-ext. In the single-TSV clock tree, slew varies within with an average slew of 53.9 ps. The slew distribution of the multiple-TSV case is in the range of with an average slew of 46.8 ps. Compared with the single-TSV case, the multiple-TSV case reduces the maximum slew and average slew by 2.4 and 7.1 ps, respectively. The main reason for the improved slew distribution of the multiple-TSV 3D tree is the shorter wire length, which in turn reduces the capacitive load. Thus, we conclude that multiple TSVs are effective in improving the slew distribution.
The impact of CMAX, the maximum clock buffer load capacitance, on slew variations (min, average, max) and power consumption in the single-TSV and multiple-TSV clock trees is shown in Fig. 15 . First, CMAX remains an efficient means to control the maximum slew in 3D clock network design. Both the single-TSV and multiple-TSV cases have similar trends as CMAX varies from 300 to 175 fF: a smaller CMAX reduces the maximum slew, but increases the clock power. This is because each buffer stage is allowed to drive a smaller capacitance with smaller CMAX, which in turn requires more buffers and thus consumes more power. Second, given a certain CMAX, multiple-TSV clock trees always have reduced maximum slew and less average slew, as compared with the single-TSV cases. Third, we note that the multiple-TSV case always consumes less power than the single-TSV case. Therefore, we conclude that the multiple-TSV case achieves both low power and better slew results.
F. Scaling the Supply Voltage
In this section, we investigate the impact of supply voltage scaling on 3D clock power, clock skew, and slew. The clock skew and power changes when the supply voltage is scaled down from 1.2 to 0.7 V. These changes are shown in Fig. 16 , for a clock frequency of 1 GHz. We first compare the two clock networks based on 15 fF and 100 fF TSV capacitance. Both of the clock networks use 125 TSVs. We first observe that both clock networks have a similar trend when the supply voltage is scaled down: The clock power is reduced from around 1.2 W to 0.4 W, which is more than a 65% power reduction. Second, the clock skew increases from 20 to 80 ps if the TSV capacitance is 15 fF, and from 20 to 120 ps for 100 fF TSVs. Moreover, the clock skew for a 100 fF TSV capacitance increases faster than that for a 15 fF TSV capacitance. This is mainly because the former uses 2830 clock buffers, whereas the latter uses 2789 clock buffers. The more buffers a 3D clock tree contains, the faster the clock skew degrades with the supply voltage scaling down. Thus, if the maximum simulated clock skew is set to 40 ps, the clock network can normally operate above 0.8 V and 0.9 V, using 15 fF TSVs and 100 fF TSVs, respectively.
The impact of scaling the supply voltage on the clock slew distribution and power changes is shown in Fig. 17 . The supply voltage is scaled down from 1.2 to 0.7 V, and the clock frequency is kept at 1 GHz. We compare two clock networks: the first uses TABLE V COMPARISONS WITH [11] 125 TSVs, and the second uses 4782 TSVs. Both clock networks are based on 15 fF TSVs. We find that the clock network using 4782 TSVs always has better control on slew distribution, regardless of the supply voltage value. In addition, the clock tree using 4782 TSVs consumes lower power than the tree using 125 TSVs for all the voltage levels. As discussed earlier, this is due to the faster reduction in capacitance from shorter wire length and fewer buffers than the TSV capacitance increase by using more TSVs. When the supply voltage scales down, the power difference between these two clock networks is reduced.
G. Comparison With Existing Work
We show the comparison of our work with [11] in Table V. Note that [11] does not support buffer insertion or provide any SPICE simulation results. However, we attempted a comparison with [11] by disabling our buffer insertion. We use the same benchmark settings and report the skew and delay values in the Elmore delay model. We observe that our method uses 21.3%-33.7% fewer TSVs than [11] while using 5.2%-5.8% more wire length. Note that in our work we can control the TSV count versus wire length tradeoff by tweaking the TSV bound. In addition, these results come from unbuffered clock trees. Our sDMBE algorithm supports buffer insertion, which helps to properly control wire snaking and therefore better minimizes the wire length.
VII. CONCLUSION
In this paper, we explored design optimization techniques for reliable low-power and low-slew 3D clock network design. We thoroughly studied the impact of the TSV count and the TSV capacitance on clock power trends. We observed that using more TSVs helps reduce the wire length and power consumption; and shows better control over clock slew variations. However, in the case of a large TSV parasitic capacitance, clock power could increase if too many TSVs are used. We also observed that a smaller maximum loading capacitance on the clock buffers efficiently lowers the 3D clock slew. Furthermore, we developed a low-power 3D clock tree synthesis algorithm called 3D-MMMext. Experimental results show that our 3D-MMM-ext algorithm constructs low-power 3D clock designs that have comparable power and reliability to an exhaustive search with a few orders of magnitude shorter runtime.
