In this paper, we propose a new approach for gated bus synthesis [16] with minimum wire capacitance per transaction in three-dimensional (3D) ICs. The 3D IC technology connects different device layers with through-silicon vias (TSV), which need to be considered differently from metal wire due to reliability issues and a larger footprint. Practically, the number of TSVs is bounded between layers; thus, we first devise dynamic programming and local search techniques to determine the optimal TSV locations. We then employ two approximation algorithms to generate a rectilinear shortestpath Steiner graph in each device layer. One algorithm extends the well-known greedy heuristic for the Rectilinear Steiner Arborescence problem and handles large cases with high efficiency. The other algorithm utilizes a linear programming relaxation and rounding technique which costs more time and generates a nearly-optimal Steiner graph. Experimental results show that our algorithms can construct shortest-path Steiner graphs with 22% less total wire length than the previous method of Wang et al. [16] .
INTRODUCTION
Three-dimensional (3D) IC technology offers the potential for improving performance and power consumption for bus architectures in SoC design [4] . 3D IC technology uses through-silicon vias (TSVs) to connect device layers. In SoC designs, TSVs can potentially reduce global interconnect among IPs, improving communication power efficiency and performance of bus-based architectures. For example, Pathak et al. [12] show that the timing of a 3D LEON3 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ISPD '12, March 25-28, 2012 , Napa, California, USA. Copyright 2012 ACM 978-1-4503-1167-0/12/03 ...$10.00. multi-core design with AMBA bus architecture connected by TSVs is better than that of a 2D design by about 79%.
A 3D IC design needs many TSVs for data bus, address bus and control wires to connect IPs that are on different layers. However, the number of TSVs must be limited since TSV is costly to fabricate, and a large number of TSVs may degrade manufacturing yield, test cost efficiencies and available layout area on-chip. Given this consideration, designers cannot generously use TSVs for communication between device layers. Therefore, selection of TSV positions becomes important when designers seek to increase performance and reduce power consumption of bus architectures while using a restricted number of TSVs.
Many researchers [5] [11] [16] have worked on low-power bus architectures, reducing unnecessary wire loading or signal switching when the bus transfers data. Bus segmentation [5] and bus splitting [11] methods reduce wire load by masking off certain bus segments, but only consider where to mask off within a given bus topology to achieve maximum power savings. In the gated bus architecture [16] , Wang et al. synthesize a gated bus topology to reduce the wire capacitance by using distributed multiplexers and demultiplexers.
Figure 1(a) shows an implementation of the gated bus architecture with two masters s1, s2 and four slaves t1, . . . , t4. The gray multiplexers and demultiplexers mask off the unused path when transferring data from s 2 to t 1 so that s 2 only needs to drive the path consisting of the three solid arrows. Therefore, driven wire capacitance is greatly reduced. To have the minimum wire capacitance for every data transaction, the Steiner graph must contain a shortest path with length equal to the Manhattan distance between each master-slave pair. We call such a Steiner graph a shortestpath Steiner graph; our objective is to find a shortest-path Steiner graph with smallest total wire length. In [16] , Wang et al. proposed a heuristic method that first constructs a minimum rectilinear Steiner arborescence [8] [13] starting from a master, and then adds other masters one by one in each iteration to obtain a shortest-path Steiner graph.
In this paper, we investigate the problem of gated bus synthesis in 3D ICs to minimize total power consumption of the gated bus architecture. Since constraints of TSVs are different from these of on-die vias, e.g., larger footprint and keep-out zones, we must consider TSVs separately in this problem. We first develop dynamic programming and local search algorithms to determine TSV locations that minimize the sum of weighted shortest distances over all master-slave pairs in a 3D IC design. Then, given the TSV locations thus determined, we propose two approximation algorithms to synthesize a shortest-path Steiner graph with smallest total wire length on each layer of the 3D IC stack; this graph determines locations of multiplexers and demultiplexers for the gated bus architecture. The two algorithms include a greedy heuristic that can handle large instances and a linear programming relaxation and rounding method [15] that is more accurate and suitable for solving cases with relatively small scale. Overall, our method can reduce wire length by up to 22% when compared to [16] .
The remainder of this paper is organized as follows. In Section 2, we introduce the problem of gated bus synthesis in 3D ICs and formally state two problem formulations on TSV location determination and construction of shortestpath Steiner graphs. In Section 3, we give two algorithms for determination of TSV locations; these address the cases of one TSV and multiple TSVs between adjacent layers, respectively. Section 4 describes our approximation algorithms to generate a shortest-path Steiner graph with minimum total wire length. Experimental results are shown in Section 5, and conclusions are given in Section 6.
PROBLEM FORMULATION
We assume that we are given locations of masters and slaves on device layers, and the communication frequency for each master-slave pair. Our goal is to minimize total power consumption over all master-slave pairs while keeping the total wire length as small as possible. Total power is estimated as the summation of frequency multiplied by capacitance, i.e., only dynamic power. Note that since introduced auxiliary gates in our bus architecture only take a small percentage of total gate count in a design, the increase of leakage power is negligible (ignoring possible effects of power-gating).
As noted above, we cannot place an arbitrary number of TSVs between adjacent device layers. Hence, we assume that there is a given constant B indicating the maximum number of TSVs available for connecting masters and slaves between two device layers. Our experimental results (below, Section 5) show that when B > 2, this trade-off achieves nearly the optimal power consumption when the number of TSVs between layers is unbounded (B = ∞). Besides, with so few TSVs, the power impact of large capacitance TSV is insignificant. The overhead of the control wires in the gated bus architecture is small compared to the data bus and the address bus because the number of such wires is much less than the bus width. Moreover, since the control signals do not switch during data transaction, the dynamic power associated with control signal is small. We therefore ignore effects of control wiring in this study. • Problem One: Find locations of TSVs between adjacent layers so that the total length of weighted shortest paths between master-slave pairs is minimized. We take frequencies to be weights and only B TSV locations can be assigned between two adjacent layers. • Problem Two: Given optimal (fixed, from Problem One) locations of TSVs, construct a rectilinear shortestpath Steiner Graph in each layer that minimizes the total wire length subject to the existence of a layer-wise rectilinear shortest path for each master-slave pair. This is the minimal rectilinear shortest-path Steiner graph (RSSG) problem, which was first considered in [16] .
PROBLEM ONE: DETERMINATION OF TSV LOCATIONS
The input of our problem includes the locations of n masters {s 1 , s 2 , . . . , s n } and m slaves {t 1 , t 2 , . . . , t m } on L device layers from M 1 to M L , and the communication frequency c i,j for each master-slave pair (si, tj). We assume that all device layers have the same size W × H. Given the upper bound B of the number of TSVs between adjacent device layers, our first objective is to determine the best locations of TSVs so that the sum of weighted shortest-path lengths over all si − tj pairs, for 1 ≤ i ≤ n and 1 ≤ j ≤ m, is as small as possible. The weight for the shortest path between s i and t j is chosen to be c i,j . For example, Figure 3 shows a possible assignment of two sets of TSVs in blue between three device layers where the shortest path through TSVs from s1 to t1 is denoted by red arrows. The length of TSVs will be neglected in the following discussion since it does not affect the objective value. We denote the planar coordinates and layers of s i and t j for 1 ≤ i ≤ n and 1 ≤ j ≤ m by (x together. By induction, we have that there exists an optimal solution in which all TSVs are located on the Hanan grid.
As noted in [9] , restriction to Hanan grids may be suboptimal for minimizing the maximum delay between sourceterminal pairs. We assume that timing constraints can still be met by adding buffers in buses. In the following two subsections, we present an efficient dynamic programming algorithm to determine the optimal TSV location assignment when B = 1 and a local search heuristic to determine the approximately optimal TSV location assignment when B > 1.
One TSV between Adjacent Layers
In the case where B = 1, we can determine the x and y coordinates of TSVs separately by coordinate-wise additivity of rectilinear distance. Here, we give the algorithm for finding x coordinates of TSVs. We again represent an assignment of TSV locations by
denotes the x coordinate of the single TSV between M k and M k+1 . Now our objective can be expressed as
where d(s i , t j ) is the length of a shortest path from s i to t j ; w s i , w t j and w v k indicate constant weights corresponding to each term; and χ is a 0-1 indicator function. We explain the meaning of each term of (1) as follows.
• |x
The segment between two TSVs connecting M k+1 with M k+2 and M k with M k+1 , respectively.
The segment between si and the TSV connecting the layer of si with the layer above.
The segment between si and the TSV connecting the layer of s i with the layer below.
The segment between t j and the TSV connecting the layer of tj with the layer above.
The segment between t j and the TSV connecting the layer of t j with the layer below.
• χ(l
The segment between s i and t j if they are on the same layer.
If the shortest path from si to tj passes through a segment listed above, it will contribute c i,j to the corresponding weight constant. Figure 4 gives an example where the shortest path between s 3 and t 2 passes through four segments whose corresponding weight constants are labeled below. Therefore, the shortest path will contribute c3,2 to w and let x(r) be the r th coordinate of x for 1 ≤ r ≤ n+m. Let S(k) and T (k) respectively denote the sets of masters and slaves in layer M k , for 1 ≤ k ≤ L. To derive our dynamic programming algorithm, we use a function OP T to indicate partial solutions, defined as
The initial values of function OP T can be evaluated as
The value of OP T (kc, rc) indicates the minimal total length of weighted shortest paths between masters and slaves in layers {M1, . . . , M kc } plus weighted shortest paths from masters/slaves in {M 1 , . . . , M k c } to the set of TSVs with position x(r c ) between M kc and M kc+1 . The key step for computing OP T is expressed by the following recursive formula:
Intuitively, since all communications from masters/slaves in {M1, . . . , M k c −1 } to masters/slaves in M k c and the TSV between M k c and M k c +1 must go through the TSV between M kc−1 and M kc , we only need to enumerate all possible positions x(r) between M k c −1 and M k c and choose the best one to achieve OP T (k c , r c ). Finally, our objective (1) will be computed as
Theorem 2. The time complexity of the dynamic programming algorithm for determination of TSV locations is 
Multiple TSVs between Adjacent Layers
When multiple TSVs between adjacent layers are allowed (i.e., B > 1), we cannot determine TSV locations for x and y coordinates separately. Therefore, we represent the TSVs' locations by {(x 
PROBLEM TWO: APPROXIMATION AL-GORITHMS FOR GENERATING RECTI-LINEAR A SHORTEST-PATH STEINER GRAPH
In this section, we build a network to connect masters, slaves and TSVs in each device layer. Since signals are delivered on one path with distributed multiplexers/demultiplexers, we formulate our problem as the RSSG synthesis first considered in [16] . We assume that locations of a set of masters s 1 , . . . , s n and slaves t 1 , . . . , t m in a fixed layer are given. Notice that TSVs connected with this layer are considered as both master and slave. A solution of RSSG is a rectilinear routing containing all master-to-slave rectilinear shortest paths, with total wire length as small as possible. RSSG is a generalization of the minimum rectilinear Steiner arborescence (RSA) problem [8] [13] and a relaxation of the Minimum Manhattan network (MMN) problem [3] , both of which have been proven to be NP-Complete in [14] and [7] respectively. The RSA problem corresponds to the case where there is only one master s. We will introduce two approximation algorithms for RSSG in the following subsections. One is a greedy heuristic which extends the insight of a 2-approximation algorithm given in [13] for solving the RSA problem. It requires only O(nm) extra memory beyond inputs and easily handles large-scale RSSG instances. Our algorithm uses linear programming (LP) relaxation and rounding to solve small cases with higher accuracy. In practice, we observe that the solution of LP relaxation and rounding is within a factor 1.0005 of optimal. However, the LP formulation of RSSG has O(nm(n + m)
2 ) number of variables and constraints which makes it suitable only for solving instances with < 1000 master-slave pairs on a computer with 4GB RAM. The greedy heuristic starts with m slaves as m subtrees and iteratively merges a pair of subtree roots p * and q * with merging point r is farthest from masters. We can think of this heuristic as obtaining the largest possible benefit, i.e., the sum of Manhattan distances from masters to p * and q * , in each iteration. Algorithm 2 provides the details of the greedy heuristic, with explanations as follows.
Greedy Heuristic
• Lines 1-3: The subgraph G and the set of vertices P of Hanan grid Hg are initialized.
• Lines 4-5: For each slave p ∈ {t1, . . . , tm}, we define a demand set Dp initialized to be {1, . . . , n} which means si − p rectilinear shortest path needs to be constructed in graph G for every i ∈ D p . We also set the demand set to be empty for other non-slave points in P . (s i , p, q, d ) returns the benefit we can obtain in terms of s i by merging p and q using direction d. Without loss of generality, we assume that p is above and to the left of q as illustrated in Figure 5 . Two possible directions of merging p and q are shown in Figures 5(a) and 5(c) . The benefit obtained by GetBenef it(si, p, q, d) (s i , p, q, d ) will return zero. Notice that after p and q are merged, the rectilinear shortest paths from si to p and from si to q can share the path from si to vi; this is the benefit we achieve compared with connecting s i with p and q separately.
• Lines 15-20: We update the demand sets after merging p * and q * using direction d * . For each master s i such that i belongs to the intersection of Dp * and Dq * , the slave or Steiner point vi is found by the function GetSteinerP oint and we henceforth only need to connect s i with v i by a rectilinear shortest path, instead of connecting with p * and q * . The corresponding update of demand sets is described in Lines 18-19 and illustrated in Figures 5(b) and 5(d) .
• Lines 21-23: We get the merging point r and update the graph G according to the merging operation. The function ShortestP ath gives a rectilinear shortest path between two points using smallest extra edges not in G.
It is implemented by a simple dynamic programming algorithm.
• Lines 25-29: When there are no intersecting demand sets, we fulfill the remaining connection requirements. • Lines 30: We delete all redundant edges in G, i.e., as long as removing them does not change the length of any s − t shortest path.
LP Relaxation and Rounding
In this section, we first give an integer linear programming (ILP) formulation of the RSSG problem and relax it into an instance of general linear programming (LP). Then, we devise a rounding technique to obtain an approximation solution of RSSG from the optimal solution of the LP relaxation. By comparing with the objective value of the LP relaxation, our results show that this approach achieves nearly optimal solutions.
The ILP formulation for the RSSG problem is given in (6) and the details are described as follows. Suppose there are Q pairs of masters and slaves where Q = n × m. We use (s l , t l ) to denote the l th master-slave pair. To transform the RSSG problem into an ILP, we construct a directed graph N l as in Figure 6 We use the first three constraints in (6) to guarantee the existence of a valid flow for each (s l , t l ) pair. Now let EH be the set of undirected edges in the Hanan grid and let the binary variable x uv denote whether the edge (u, v) ∈ E H is selected in the RSSG solution. If a rectilinear shortest path from s l to t l includes the edge (u, v) for any 1 ≤ l ≤ Q, then xuv = 1. We use the fourth constraint in (6) to denote such constraints. Finally, if d uv denotes the length of edge (u, v), our objective is to minimize the total wire length which can be described by the objective function of (6) . Since there are O((n + m)
2 ) edges in the Hanan grid and O(nm) masterslave pairs, it is easy to see that the ILP formulation contains O(nm(n + m)
2 ) variables and O(nm(n + m) 2 ) constraints.
Since solving ILP (6) is time-consuming and even intractable when n and m are large, we propose a LP relaxation and rounding technique described in Algorithm 3. We begin by relaxing the binary constraints on variables f l uv and xuv so that they can take any nonnegative real values. We thus obtain an LP instance that can be solved efficiently. The variables x uv in the optimal solution of the LP relaxation are in the range [0, 1] based on other constraints, but may be fractional, which does not yield a feasible RSSG solution.
To construct an RSSG from the LP solution, we adopt usual approach of viewing each x uv as the probability of selecting edge (u, v) . Hence, we start with a graph G containing all edges in the Hanan grid and try to delete edges as long as the remaining graph still contains a rectilinear shortest path for each master-slave pair. The order of deletion depends on the optimal x uv values. An edge with smaller x uv will be deleted from graph G earlier if possible. Finally, the remaining graph G will be a feasible RSSG solution. The optimal objective function value of the LP relaxation lower-bounds the optimal objective function value of the ILP (6), and hence optimal RSSG solution. We can evaluate the quality of the rounding solution by comparing it with the lower bound; experimental results below show that the largest ratio between them in practice is 1.0005. 
EXPERIMENTAL RESULTS

TSV Location
We first show the TSV location results with different communication frequencies and different upper bounds B on the number of available TSVs between adjacent layers. Let L be the number of device layers. We randomly distribute N masters and slaves in each layer. Figure 7 shows two optimal assignments of TSVs with different communication frequencies and the same distribution of masters and slaves. TSV locations are plotted by red lines. Blue dots and green dots denote masters and slaves respectively. For clarity, we simplify each layer as a straight line. Here we assume B = 1 and the dynamic programming approach is adopted to obtain optimal solutions. We notice that in Figure 7 (b), TSV locations connecting the first three layers are to the left of those in Figure 7 (a). At the same time, masters in bottom two layers are farther away from slaves in the top layer in Figure 7 (b). These observations reflect the fact that master-slave pairs in the first two layers communicate more frequently. Table 1 compares the power consumption results for various bounds B. We assume that communication frequencies for all master-slave pairs are the same in this experiment. The power consumption is defined as the total length of shortest paths normalized by the number of master-slave pairs. Notice that when B = ∞, every master-slave pair can be reached by a path with shortest rectilinear distance which gives us a lower bound on power consumption. In each cell of the table, the first value represents the power consumption and the second value is the increase of (as a percentage) of power consumption relative to the case B = ∞ in the same row. We can see that the lower bound can be almost achieved by only using three TSVs between adjacent layers. Figure 8 gives an example of the TSV assignment results along with their power consumption for bound B = 1, 2, 3 on the same setup of masters and slaves. Table 2 gives the total wire length improvement of our algorithms when compared with the RSSG result in [16] . The column "LP(obj)" denotes the optimal objective value of the LP relaxation in (6) . The column "LP(Round)" denotes the total wire length of the LP relaxation and rounding result. Notice that in most cases, the rounding result and the optimal LP objective are the same, which means that we already obtain the optimal RSSG (since LP relaxation gives a lower bound of smallest total wire length). In practice during our experiments, the largest gap between "LP(round)" and "LP(obj)", i.e., LP(round)/LP(obj), is 1.0005 which indicates that our rounding method can achieve nearly optimal solutions. The column "Ratio" gives the improvement of our RSSG generated by LP relaxation and rounding relative to RSSG generated by the algorithm in [16] . We can see that our algorithm can reduce the total wire length by up to 21.78%. Figure 5 .2 plots examples of our RSSG solutions generated by the greedy heuristic and by LP relaxation and rounding. There are three masters and sixteen slaves, which are represented by blue dots and green dots, respectively. The LP relaxation and rounding method for this example achieves an optimal RSSG with total wire length of 5683, 5.38% better than the greedy solution whose total wire length is 6006. Table 3 shows the time and memory complexity of our LP relaxation and rounding algorithm. We use the Gurobi Optimizer [1] running on a machine with Intel Core i3 2.4GHz CPU and 4GB memory to solve LP relaxation of (6). Based on the memory limit, we can solve RSSG instances with up to 1000 master-slave pairs by LP relaxation and rounding. 
RSSG Construction
CONCLUSION
In this paper, we construct a framework and algorithms for synthesis of gated buses in 3D ICs. The power consumption of on-chip communication is considered to be the first objective and we achieve it by optimizing TSV locations between device layers. In each device layer, we build a rectilinear shortest-path Steiner graph to connect masters and slaves whose objective is to minimize the total wire length.
Experimental results show that our dynamic programming (resp. local search) algorithm efficiently determines the optimal (resp. nearly optimal) TSV locations, and that the approximation algorithms for generating RSSG reduce the total wire length by up to 22% compared with [16] .
