Abstract-In this paper, we present a high-quality analytical 3-D placement framework. We propose using a Huber-based local smoothing technique to work with a Helmholtz-based global smoothing technique to handle the nonoverlapping constraints. The experimental results show that this analytical approach is effective for achieving tradeoffs between the wirelength and the through-silicon-via (TSV) number. Compared to the state-ofthe-art 3-D placer ntuplace3d, our placer achieves more than 20% wirelength reduction, on average, with a similar number of TSVs. Furthermore, we extend this analytical 3-D placement framework with thermal awareness. While 2-D thermal-aware placement simply follows uniform power distribution to minimize temperature, we show that the same criterion does not work for 3-D ICs. Instead, we are able to prove that when the TSV area in each bin is proportional to the lumped power consumption of that bin and the bins in all tiers directly above it, the peak temperature is minimized. Based on this criterion, we implement thermal awareness in our analytical 3-D placement framework. Compared with a TSV oblivious method, which only results in an 8% peak temperature reduction, our method reduces the peak temperature by 34%, on average, with slightly less wirelength overhead. These results suggest that considering the thermal effects of TSVs is necessary and effective during the placement stage.
One challenge to 3-D IC design comes from the occurrence of through-silicon-vias (TSVs). Tiers in a 3-D IC are connected using TSVs. However, TSVs are usually etched or drilled through the device layers of each tier by special techniques and are costly to fabricate. A large number of TSVs will increase the area overhead and the cost of the final 3-D chip. Also, under the current technologies, TSV pitches are very large compared to the sizes of regular metal wires-usually around 5-10 μm. In 3-D IC structures, TSVs are usually placed at the whitespace between the macros or cells, so the number of TSVs will not only affect the routing resource but also affect the overall chip or package areas. Therefore, the number of TSVs in a circuit is constrained and needs to be controlled during physical design.
Another critical challenge to 3-D IC design is heat dissipation, which has already posed serious problems-even for 2-D IC designs [5] . The thermal problem is exacerbated in 3-D ICs for two main reasons: 1) the vertically stacked multiple tiers of active devices cause a rapid increase in power density, and 2) for face-to-back tier bonding, a dielectric layer exists between each tier to provide insulation. The thermal conductivity of the dielectric layers is very low compared to silicon and metal. For instance, at room temperature the thermal conductivity of the dielectric layer is 0.05 W/mK, while the thermal conductivity of silicon and copper is 150 and 285 W/mK, respectively [37] . Accordingly, the heat can mainly flow along TSVs instead of through the entire substrate. Such a decrease in the cross-sectional area of the heat channel further increases the chip temperature. Therefore, it is necessary to consider the thermal integrity during every stage of 3-D IC design, including the placement stage.
All of the existing 3-D placement approaches, which will be reviewed in Section II, are able to explore the tradeoffs among the wirelength and the number of TSVs. Two recent academic 3-D placers are the force-directed 3-D placer [29] and ntuplace3d [25] . The former placer is able to model the TSV area, but it cannot optimize the tier assignment during 3-D placement. The latter placer extends the bellshaped function to measure the 3-D area density, and was the best 3-D placer at the time of publication. However, the optimality of these placers is as yet unknown, leaving room for further improvement.
Moreover, it is well known that for 2-D ICs, properly distributed power dissipations can result in low temperatures. Most of the thermal-aware 3-D placers work simply extends this conclusion to 3-D and still focuses on power distribution 0278-0070/$31.00 c 2013 IEEE manipulation. However, as detailed in Section IV-A, this is no longer a good heuristic for temperature reduction in 3-D ICs. Since TSVs are the major channel for heat flow, their distribution dominates the impact on the temperature. A survey on concurrent TSV planning within thermal-aware 3-D floorplanning and 3-D routing is given in [19] . Unfortunately, none of the existing work in 3-D placement takes the thermal effect of TSVs into consideration, mainly due to the high complexity of such a practice.
In this paper, we first design a high-quality solver for the 3-D placement problem with the objective of wirelength and TSV number, so that it can be used as a basic engine to consider other constraints and objectives in 3-D placement. In our nonlinear optimization-based 3-D placement approach, our major contribution is the employment of both local and global smoothing techniques for the 3-D area density functions. Experimental results clearly demonstrate that these techniques achieve even better results than the state-of-the-art 3-D placers.
We further extend our placement engine to consider both the thermal effect and the area impact of TSVs. We first devise a simple criterion to guide the placement of TSVs for achieving the lowest temperature. By assuming that the dielectric layer is an ideal heat insulator, we are able to prove that when the TSV area of each bin is proportional to the lumped power consumption of that bin and the bins in all the tiers directly above it, the peak temperature is minimized. We then use this result to guide our 3-D placement engine. The smoothing techniques are also applicable here when we model the thermal awareness feature using density-like cost functions. Experimental results show that compared to the method that prefers a uniform power distribution, which only results in an 8% peak temperature reduction, our method reduces the peak temperature by 34% on average with even slightly less wirelength overhead. To the best of the authors' knowledge, this is the first thermal-aware 3-D placement tool that directly takes into consideration the thermal and area impact of TSVs.
A preliminary version of the thermal-aware feature was presented in [17] . In this paper, we include an enhanced 3-D placement framework, which supports both local and global smoothing techniques compatible with the thermal-aware feature. The remainder of this paper is organized as follows. Section II discusses related work in both wirelength-driven and thermal-aware 3-D placement approaches. Section III describes our basic 3-D placement framework and algorithm details. Section IV discusses the application of our 3-D placer to relieve thermal issues. Section V presents the experimental setups and results. Section VI concludes this paper.
II. Related Work

A. 3-D Placement Approaches
Most of the existing approaches, especially at the global placement stage, can be viewed as extensions of 2-D placement approaches. We group the 3-D global placement approaches into the following categories.
Partition-based approaches [4] , [18] , [23] : This kind of approach applies a sequence of bipartitions to perform the global placement in a divide-and-conquer paradigm, with intertier z cuts to minimize the number of TSVs, or intra-tier x/y cuts to minimize the wirelength. The cost of partitioning is measured by the cutsize of the nets across partitions. The order of the cutting directions determines the total TSV number. The earlier that z cuts are performed, the fewer TSVs are needed; the later that z cuts are performed, the more TSVs are needed.
Force-directed approaches [21] , [24] , [28] : Since the unconstrained quadratic wirelength minimization will result in a great amount of overlap, repulsive forces are introduced for overlap removal. The repulsive forces are computed iteratively, which eventually reduces the overlaps to an acceptable amount. There are two methods for computing the repulsive forces: 1) the forces point to the negative gradient of the area density field [21] , [28] or 2) the forces point to the desired placement estimated by cell shifting [24] .
Analytical approaches [25] , [38] , a.k.a. generalized forcedirected approaches: The analytical solver minimizes a sequence of penalized objectives, one of which is usually a wirelength/TSV objective plus a weighted overlap penalty. The weight of the overlap penalty increases from a small number until the overlaps are reduced to an acceptable amount. For example, [38] computes an overlap penalty from the unevenness of area distribution by computing discrete cosine transform (DCT) frequencies, and locally approximates this penalty function using a quadratic function. Minimizers of such overlap penalties are legal placements. The work in [25] extends the bell-shaped function to measure the area density for 3-D cubes to formulate of the 3-D area density constraints.
Partition-first approaches [1] , [29] : Unlike the approaches mentioned above, this kind of approach divides the 3-D global placement into two steps: 1) a vertical partitioning step to determine the tier assignment, and 2) an intratier placement step to determine the locations of placeable objects inside every tier. The vertical partitioning step is performed either by mincut partitioning [1] , or controlled-size cut partitioning [29] . The intratier placement can be implemented by straightforwardly extending any 2-D placement approaches.
Transformation-based approaches [15] , [20] : This kind of approach is capable of reusing existing 2-D placement solutions and constructing 3-D placement by transformation.
There are few publications specific to the 3-D legalization and the 3-D detailed placement problems. Usually, the legalization and detailed placement can be completed by running a 2-D legalizer and a 2-D detailed placer tier-by-tier with appropriate constraints.
B. Thermal Awareness
There are several works that address the thermal issue during 3-D placement. The force-directed method [21] applies thermal repulsive forces to move cells away from hotspots. The transformation-based 3-D placement [15] relieves the thermal issues at the legalization stage, where it is preferable to place hot cells close to the heat sink. The partitioning-based 3-D placement [23] uses net weights to shorten the high switching nets to reduce power, and uses pseudo nets to pull hot cells to the heat sink to reduce temperature. The work in [38] models and minimizes the unevenness of thermal distribution, in addition to minimizing the wirelength and the unevenness of cell area distribution. A detailed survey of 3-D physical design can be found in [10] and [11] .
III. Analytical 3-D Placement Framework
A. Overall Flow and Problem Formulation
This section presents our overall 3-D placement flow and formulates the analytical 3-D placement problem. The 3-D placement flow is illustrated in Fig. 1 . The flow consists of a floorplanning stage and a placement stage.
At the 3-D placement stage, two 3-D placers are supported-the pseudo 3-D (P3D) placer and the full 3-D (F3D) placer. The P3D placer is designed for the scenario where the tier assignment is fixed, and the solver places multiple tiers simultaneously. The F3D placer is designed for the scenario where the tier assignment is variable, and the placer has the capability to optimize the tier assignment as well as the intratier placement. The legalization and detailed placement are completed tier-by-tier using XDP [13] .
The 3-D floorplanning stage is optional when we apply the F3D placer. The floorplanner adapts the CBA algorithm in [14] , and plans a coarsened netlist for the scalability consideration. The floorplanning solution at this stage is either used by the P3D placer for a given tier assignment, or used by the F3D placer as an initial placement solution.
The following subsections formulate the analytical 3-D placement problem to be solved by P3D and F3D. Given a netlist H = (V, E) the number of tiers K, and the per-
, where V is the set of nodes including standard cells, intellectual property (IP) blocks, and other high-level hard macros, and E is the set of nets, a placement (x i , y i , z i ) of node v i ∈ V indicates that its center is at (x i , y i ) ∈ R and its tier assignment is z i ∈ {1, 2, . . . , K}. The 3-D placement problem is to find an optimal placement (x i , y i , z i ) for every v i ∈ V , so that an objective function of the weighted total wirelength is minimized, subject to the nonoverlapping constraints.
1) Wirelength Objective Function: The quality of a placement solution can be measured by its performance, power, and routability, but the measurement is nontrivial. In order to model these aspects during the optimization stage, the weighted total wirelength is a widely accepted metric of placement qualities [32] . Formally, letx,ȳ andz, respectively, be the vectors of (x i ), (y i ), and (z i ) the objective function is defined as
The objective function depends on the placement (x,ȳ,z) and it is a weighted sum of the wirelength WL(e) and the TSV number TSV (e) over all the nets. The weight (1 + γ e ) reflects the criticality of the net e, which is usually related to the performance optimization. The unweighted wirelength is represented by setting γ e to zero. This weight is able to model thermal effects by relating the weight to the thermal resistance, electronic capacitance, and switching activity [23] .
The wirelength WL(e) is usually estimated by the halfperimeter wirelength (HPWL)
Similarly, TSV (e) is modeled by the range of {z i : v i ∈ e} [15] , [21] , [23] TSV (e) = max
The coefficient α TSV is the weight for the TSVs; it models a TSV as a length of wire. For example, the work in [19] estimates that under the 0.18 μm silicon-on-insulator (SOI) technology, a TSV with a length of 3 μm is roughly equivalent to 8-20 μm of metal-2 wire in terms of capacitance, and it is equivalent to about 0.2 μm of metal-2 wire in terms of resistance. Thus, a coefficient α TSV between 8 and 20 μm can be used for optimizing power or delay in this case. In addition, we need a much larger coefficient α TSV when the available area of TSVs is limited; in such cases, this coefficient serves as both a penalty factor (to reduce the number of TSVs) and a modeling coefficient (to model the power or delay).
2) Nonoverlapping Constraints (Pseudo 3-D Placer): The ultimate goal of nonoverlapping constraints can be expressed as follows:
where w i and h i are the width and height of node v i , respectively. The same applies to node v j . Such constraints were used directly in some analytical placers early on [8] .
However, this formulation leads to a huge number of eitheror constraints, which grows quadratically with the number of nodes. This amount of constraints is not practical for modern large-scale circuits.
To formulate and handle these pair-wise nonoverlapping constraints, modern placers use a more scalable procedure to divide the placement into global placement and detailed placement. Detailed placement assigns every node to a legal site to satisfy the constraints in (4), and allows global placement to relax the pair-wise nonoverlapping constraints by regional area density constraints
For a 3-D IC with K tiers, each tier is uniformly divided into M × N bins for the measurement of overlaps, where the width and height of each bin is w bin = W tier /M and h bin = H tier /N, respectively. If every bin m,n,k satisfies (5), the global placement satisfies the nonoverlapping constraints. Examples of the regional area density constraints on one tier are given in Fig. 2 .
The overlapped area between bin m,n,k and node v i is defined as
where δ(z i , k) indicates whether node v i is on the same tier as bin m,n,k , and the functions Overlap x and Overlap y are the overlaps between the projections of node v i (a rectangle) and bin m,n,k (another rectangle) on the x-axis and the y-axis, respectively. The center of bin m,n,k is located at
Formally, these functions are defined as follows:
where Therefore, there are only M × N × K regional area density constraints; this number is usually much smaller than the number of node pairs.
3
) Nonoverlapping Constraints (Full 3-D Placer):
The nonoverlapping constraints defined in (5) in the previous subsection only work with a discrete {z i }. To make use of the analytical solver in the F3D placer, these discrete variables are relaxed and mapped to a continuous space.
. This definition is compliant to the discrete case, when z i = k indicates the node v i is placed on tier k.
In order to measure the 3-D area density, the virtual 3-D placement region is further divided into M ×N ×L bins. Each bin has a width of w bin = W tier /M, a height of h bin = H tier /N, and a depth of d bin = K/L. Accordingly, node v i consumes a virtual area of
The area constraints in the virtual 3-D placement region are defined as
The number of bins L in the z direction is not necessarily the same as the number of tiers K. As discussed in [12] , L = K is not enough to capture the nonoverlapping constraints, and L = 2K is sufficient to reflect the nonoverlapping constraints by the 3-D area density constraints. Thus, we assume L = 2K in the remainder of this paper.
The overlapped virtual area between bin m,n,l and node v i is defined as
The overlapping functions in the x and y directions are the same as (8) and (9), respectively. The overlap function in the z direction is defined as
This Overlap z can be viewed as a generalized version of The analytical 3-D placement problems (P3D and F3D) are then expressed as follows, respectively:
and
The nonoverlapping constraints in (5) and (10) are converted to equality constraints by inserting filler nodes [7] . We shall extend these formulations to add in thermal constraints in Section IV.
The P3D placer solves the problem in (13) with L = K, d i = 1, and a constantz, while the F3D placer solves the problem in (14) with a variablez. In the remainder of this section, we will discuss the detailed implementation of these analytical solvers.
B. Nonlinear Programming Solver
The equality-constrained optimization in (13) and (14) can be iteratively solved by a sequence of unconstrained optimization using the quadratic penalty method [34] . For simplicity, the function Penalty(x,ȳ,z) either refers
. It is obvious that the constraints are satisfied if and only if Penalty(x,ȳ,z) = 0.
The outer iterations of the quadratic penalty method work as minimize {OBJ + μ · Penalty} increase μ and repeat until Penalty ≈ 0.
In each outer iteration, given the penalty factor μ, we need to solve an unconstrained optimization problem. The solution of this problem is equivalent to the steady solution to the following ordinary differential equation (ODE) [7] d(x(t),ȳ(t),z(t))/dt = −∇F μ (x(t),ȳ(t),z(t)) (x(0),ȳ(0),z(0)) is a given initial placement (16) where
This ODE can be solved by the explicit Euler method, which gives the following iterative scheme:
As stated in [7] , the stepsize τ has to be small enough to guarantee convergence. The analytical upper bound for τ depends on the Hessian of F μ (x, y, z) which is difficult to determine. In practice, the value of τ is determined in an adaptive way: an initial stepsize τ is tried and then the convergence is checked; if it does not converge, the stepsize is scaled down by a ratio between 0 and 1 (e.g., 0.6), and the trial and error process is repeated.
Combining the outer iterations in (15) and the iterations in (17), we are able to solve the constrained optimization in (13) and (14) . The multilevel scheme is applied as in [7] and [12] .
1) Local Smoothing of the Overlapping Functions:
The overlapping functions defined in (8) , (9) , and (12) are nondifferentiable; thus, smoothing is required for the application of analytical methods. In this section we present the Huber smoothing as an alternative to the patented bell-shaped smoothing [33] . Moreover, the experimental results will show that the placement quality with Huber-based local smoothing plus Helmholtz-based global smoothing is better than the bellshaped smoothing approach [25] .
The gradient computation method, presented in [16] for 2-D placement, considers a continuous area density function by assuming the resolution M × N to be infinity. However, the area density map on the kth tier is implemented as a 2-D array
There is a gap between the formulation and the implementation. In order to bridge this gap, we introduce a local smoothing based on the Huber function, which serves an alternative to the bell-shaped local smoothing, and also presents another angle to understanding the gradient computation method in a discrete formulation.
We observe that the common length function in (8), (9), and (12) can be explicitly expressed as
These overlapping functions are nondifferentiable because of the absolute values. There are multiple ways to approximate the absolute value function by a differentiable function. To avoid the computation of the log function or the square root function, we choose the Huber function [6] as an approximation, which is defined as
To compare the overlapping function, the bell-shaped approximation, and the Huber-based approximation, we visualize these functions in Fig. 3 as the overlaps between a node and a bin in one dimension. The bin is located at the origin point with a unit bin width, and the nodes are with different widths (w node ) from 0.1×w bin to 10 ×w bin , illustrated from Figs. 3(a) to 3(e).
The Huber-based approximation is generated by setting the parameter μ = w bin /2. The bell-shaped approximation as shown in (20) , is generated by setting w = w bin and μ = w node , and is scaled so that the area covered by the curve is equal to the node area (20) The x-axis is the placement of the node, and the y-axis presents the overlapping length. These figures show that the Huber-based approximation is more accurate than the bellshaped approximation, especially when the node width w node is much greater (e.g., 10× greater) than the bin width w bin .
2) Density Penalty Function and Global Smoothing: If we replace the absolute value function in (18) with the Huber function, the overlapping functions become differentiable. Thus, the quadratic penalty function with Huber-based local smoothing for the area density constraints in formulation (13) and (14) is written as
We would like to apply the Helmholtz global smoothing as defined in [7] and [16] . We apply 2-D Helmholtz smoothing in the P3D placer to smooth D l (x,ȳ,z) tier-by-tier, and we apply 3-D smoothing in the F3D placer to smooth D(x,ȳ,z).
The Helmholtz smoothing can be implemented by solving a linear system [7] . We skip the details of implementation, and use the symbol A ,2−D as the 3-D Helmholtz smoothing operator, and A ∈,2−D as the 2-D Helmholtz smoothing operator, both of which are constant matrices determined by the structure of the related linear system and the smoothing parameter . Thus, the globally smoothed area densities and area capacities are expressed as
Therefore, the two versions of the quadratic penalty functions with global smoothing are computed as follows:
The gradients of these area density penalty functions can be simply computed by
These gradient expressions in the discrete area density formulation are consistent with the gradient computation method discussed in [16] are the twice-smoothing operators, which only need to be computed once and are reused in the computation of all the elements of the gradient.
We notice that the area density penalty function Penalty global,P3D (x,ȳ,z) is similar to the penalty function defined in [12] , based on the fact that the Huber-based smoothing is similar to the bell-shaped smoothing when the node size is twice as large as the bin size (L = 2K). Therefore, Penalty global,P3D (x,ȳ,z) is considered a reformulation of the penalty function in [12] .
IV. Thermal Awareness for Analytical 3-D Placement
In this section, we enhance the analytical 3-D placer with thermal awareness. Specifically, we take advantage of the thermal conductivity of TSVs for temperature reduction. We derive an optimal condition of the TSV-power distribution; this optimal condition enables the analytical 3-D placer to reduce temperature efficiently and effectively.
A. Motivation
The stack-die structure has dramatically increased power density compared to conventional 2-D ICs, and thus threatens the thermal reliability of 3-D ICs. In addition, the low thermal conductivity of the dielectric layers in face-to-back bonding tiers prohibits the heat from flowing vertically. Accordingly, as pointed out in [22] 
Such an observation results in the fundamental difference between the thermal-aware placement for 2-D ICs and for 3-D ICs. In 2-D placement, by properly distributing the power dissipations across the chip, heat can flow uniformly through the entire substrate to the heat sink, and the temperature can be minimized [35] . However, in 3-D ICs, it is the correlation between the distributions of the TSVs and the power density that has a direct impact on the temperature. For example, we assume a 4-tier 3-D chip with 6 W power in a 1.5 mm 2 area, and about 1200 TSVs per tier, where the 3-D technology parameters for temperature evaluation are the same as the experimental setups in Section V-B. We compare the two artificial placement results with the relative power values shown in Fig. 4(a) and (b) . In Fig. 4(a) , the power distribution is uniform while the TSVs are clustered in the center; while in Fig. 4(b) , the power distribution is nonuniform with 2 to 8 times higher power density in some regions than the previous case, and the TSVs are clustered proportional to the regional power density. The corresponding temperature maps are shown in Fig. 3(c) and (d) , respectively, where we can see that the nonuniform power distribution actually results in a lower temperature. From this artificial example, we can see that the locations of the TSVs play a very important role in the thermal integrity of 3-D ICs.
As expected, it is suboptimal for existing thermal-aware 3-D placement to be targeted at distributing power dissipations and neglect the thermal effect of TSVs. To improve this, a naïve approach would be to compute the optimal locations of the TSVs that can result in the minimum temperature during each iteration of placement. However, it will result in an optimization-in-the-loop with significant runtime overhead. Since thermal-aware placement mainly targets large designs, this method is less practical. On the other hand, if we adjust the locations of the TSVs after placement is done to minimize the temperature, it will bring about significant wirelength overhead because these TSVs are also part of the signal nets. We will address this dilemma in the remainder of the paper.
There are many different 3-D integration technologies, and different techniques can have totally different thermal models. In this paper we focus on the face-to-back bonding. In addition, although it is possible to insert additional thermal TSVs [22] after placement to further suppress the temperature, it brings in extra area overhead. Here we focus on exploring the opportunities of temperature reduction by utilizing the signal TSVs in 3-D placement. Our experimental results show that signal TSVs alone can already reduce the temperature significantly, with minimal wirelength or runtime overhead. Additional TSV insertion for thermal optimization as in [22] becomes optional.
B. Properties of a Thermally Optimal TSV Distribution
As discussed in Section IV-A, the fundamental problem in thermal-aware placement can be stated as follows. Given a power distribution, what is the optimal distribution of TSVs so that the temperature is minimized? While this problem seems to be complicated, we will show that the answer is surprisingly 
TABLE I
Major Notations
B (B 0 )
Thermal conductance matrix (without TSVs).
T (P)
Vectorized temperature (power) map.
The temperature in bin i for two-tier case (in bin i, tier k for multitier case).
The power in bin i for two-tier case (in bin i, tier k for multitier case).
A tot (A tot,k ) Total TSV area for two-tier case (in tier k for multitier case).
a i (a i,k ) TSV area in bin i for two-tier case (in bin i tier k for multitier case).
M i (M i,k )
Stamping matrix of the lumped TSV in bin i for two-tier case (in bin i, tier k for multitier case). n Number of bins in each tier. K Number of tiers. g TSV Thermal conductance of a unit area TSV.
simple. We can derive an analytical solution applicable to any optimization tools and thermal resistive network models. For simplicity of presentation, we summarize the key notations used in this section in Table I . To start, we assume steady-state analysis to calculate the temperature, where the chip is thermally modeled as a resistive network. We also lump the TSVs in each bin as a thermal conductor, with its conductance proportional to the total TSV area. The temperature-temperature relation can be expressed as
An example of the thermal resistive network is illustrated in Fig. 5 , where the nodes (labeled with numbers) are connected by thermal conductors (labeled with subscripted symbols), and the bin numbers are in a gray color. Take node 3 (bin 3, tier 1) for example; the power-temperature relation is expressed as g (1, 3) (t 3,1 − t 1,1 ) + g (3, 4) (t 3,1 − t 4,1 ) + g (3,7) (t 3,1 − t 3,2 ) = p 3,1 .
Thus, the network can be written in a matrix form as (28) , where each row corresponds to one node. If we treat TSV size as variables, the thermal conductance matrix B of the network can be expressed in a parameterized form as
where B 0 is the constant thermal conductance matrix without TSVs, and the variable a i,k is the total area of a lumped TSV in bin i, tier k. The stamping matrix M i,k indicates the connectivity of a lumped TSV from bin i, tier k to bin i, tier k+1. If we denote j 1 and j 2 to be the node ID corresponding to bin i, tier k and bin i, tier k+1 in the thermal resistive network, then based on the basic rules of stamping an element in a conductance matrix in SPICE [36] 
and all the other elements in M i,k are zeros. Note that we have taken the element value outside the matrix. Again, take node 3 for example. Let b (3, 7) be the thermal conductance between node 3 and node 7 when there is no TSVs, g TSV be the conductance of a unit-area TSV, and the variable a 1 3 be the area of a lumped TSV in bin 3; the conductance becomes
In this example, the stamping matrix M 3,1 only has nonzero elements M 3,1 (3, 3) = M 3,1 (7, 7) = +1, and M 3,1 (3, 7) = M 3,1 (7, 3) = −1. Now, we can mathematically state the problem for optimal TSV placement as
where A tot,k is the total area of the TSV connecting tier k and tier k+1, and is determined once the floorplanning is done. The infinity norm is defined as ||x|| ∞ = max{|x 1 |, |x 2 |, . . . , |x n |}. The objective function is obtained by simply substituting (30) into (28) . The two constraints are also self-evident: the total TSV area in each tier is a fixed number, and the lumped TSV area in each bin should be non-negative. Note that we have relaxed the constraint that the TSV area a i,k in each bin should be discrete. As such, the TSV areas mentioned in the theorems and corollaries proposed below should be rounded.
Problem (P1) is nonlinear in nature. Integrating nonlinear optimization engines in a placement tool directly would be impractical due to the high complexity.
Before we directly tackle (P1), we resort to a simpler version of the problem. For a two-tier 3-D IC, as shown in Fig. 6(a) , with a given power distribution, what will be the optimal locations of TSVs so that the temperature is minimized?
In this case, the bottom tier is directly attached to the heat sink, and we may assume that it has a uniform temperature to serve as the thermal ground. Accordingly, each TSV will be connecting between a node on the top tier and the thermal ground, and (P1) can be rewritten as
where M i is the stamping matrix for the TSV in bin i, a i is the total TSV area in bin i, and A tot is the total TSV area. At first look, this problem is still nonlinear and difficult to solve. But intuitively we should place more TSVs in the bins with higher power density to provide lower impedance to thermal ground. This leads to the conjecture that the optimal TSV area a * i in bin i should be proportional to the power dissipation p i . This conjecture is indeed correct, as stated in the following theorem:
Theorem 1 (Two-Tier Case): To minimize the peak temperature, the TSV area in bin i should be proportional to the power in that bin; i.e., the optimal solution of problem (P2) is
In the interest of space, we will only outline the proof for the theorem. From the fact that TSVs are the major vertical heat flow channel (g TSV a k b k,l where b k,l is the inter-tier conductance without TSVs), we can get
where a = [a 1 , a 2 , . . . , a n ] T . Based on Hölder's inequality, we have
Combining (35) and (36), we have
In order for ||T || ∞ to attain the above minimum, the inequalities in (37) must become equality. According to Hölder's inequality, such a condition is
Substitute it back to (35), and we can get
which, along with the second constraints in (P2), yields
ٗ Note that in the above theorem, we neglected the fact that the total TSV area in each area is discrete, that the dielectric layer is not an ideal thermal insulator, and that the total TSV area allocated in each bin cannot exceed the area of that bin. In reality, the optimal condition needs to be tailored to fit into these constraints. We can also easily derive a corollary based on this theorem.
Corollary 1: When the TSVs are placed proportional to the power consumption in each bin, the temperature in each bin is identical, that is
Corollary 1 has a particularly important meaning, as it allows us to generalize Theorem 1 (which is limited to the two-tier case) to the general multitier cases, as shown in the theorem below.
Theorem 2 (Multitier Case):
If we denote the bottom tier (attached to the heat sink) as tier K, and the top tier as tier 1, then to minimize the temperature, the TSV area in bin i of tier k(1 ≤ k ≤ K − 1) connecting to tier (k + 1) should be proportional to the lumped power in bin i of tier 1, 2 . . . , k. In other words, the optimal solution of problem (P1) shall satisfy
The proof can be derived based on the induction on the number of tiers with Theorem 1; this is because the optimized temperature in a tier is uniform and can be treated as thermal ground to further optimize upper tiers. Fig. 6(b) shows a simple three-tier (K = 3) example to illustrate the theorem, where each tier is divided into four bins (n = 4).
Similar to Corollary 1 for the two-tier case, we also have the following corollary for the multitier case.
Corollary 2: When the TSVs in each tier are placed proportional to the lumped power consumption in each bin and the same bins in all the tiers directly above, then each tier shall have a uniform temperature distribution. The temperature in tier j can be expressed as
To summarize this section, we would like to point out that all the theorems and corollaries are based on the assumption that TSVs are much more effective in conducting heat than the dielectric layer. And accordingly, we have treated the dielectric layer as an ideal heat insulator. In reality this may not be correct, and our theorem needs to be modified if the thermal conductivity of the dielectric layer is comparable with that of the TSV. Assume that the area of bin i is S i , the substrate thickness (including the dielectric layer) is L 1 , and the dielectric layer thickness is L 2 . Further, assume the thermal conductivity of the filling material of TSVs is κ 1 , and that of the dielectric layer is κ 2 . Then in the first-order approximation the thermally conducting dielectric layer is equivalent to a thermally insulated dielectric layer with some fictitious TSVs with equivalent thermal conductance, whose area a i in bin i satisfies κ 1 a i /L 1 = κ 2 S i /L 2 . Theorem 1 and Theorem 2 still hold, but when counting the TSV area in each bin, the fictitious TSV area
Consider two extreme cases: if κ 2 = 0, then the dielectric layer is thermally insulated, and the fictitious TSV area is 0. This is in accordance with our original theorems. On the other hand, if κ 2 = ∞, then the dielectric layer is completely conductive. The fictitious TSV area approaches infinity, and the TSV locations no longer matter.
C. Thermal-Aware 3-D Placement
In this section, we mainly focus on the P3D placer for the TSV/cell co-placement flow. The netlist for P3D placer is constructed after 3-D net splitting and TSV insertion as in [29] .
Based on the optimality condition in Theorem 2, we are able to effectively reduce the temperature during the 3-D placement step by an analytical method like the following:
where (x,ȳ) is the placement variable, D l (x,ȳ) = C l is the area density constraints as described in Section III-A.2, OBJ(x,ȳ) is the objective function as described in Section III-A1, TSV distribution cost DIST (x,ȳ) measures the distance between the current solution and a thermally optimal distribution, and β is a user-defined parameter for tradeoffs between wirelength quality and temperature reduction. The TSV distribution cost is also normalized by a factor of the ratio between the gradient norm of the initial OBJ function and the gradient norm of the initial DIST function. Please refer to Section III-B for the algorithms that solve problem (P3) by the quadratic penalty method when β = 0 and refer to [9] for the parameter tunings when β > 0. In this section, we focus on the definition of the TSV distribution cost function DIST (x,ȳ).
The TSV distribution cost is constructed with the property that DIST (x,ȳ) = 0 if and only if the optimality condition in Theorem 2 is satisfied. In detail, the cost is constructed as follows:
Let N i,k be the number of TSVs in the bin i, tier k, and we assign a negative power valuep TSV,k to all the TSVs on tier k. The negative power value is defined as
Under this assignment, the total negative power of the TSVs in the bin i, tier k isp
Therefore, the total TSV power and the lumped cell power in the bin i, tier k is
It is obvious that this amount of power value is equal to zero if and only if the TSVs are optimally distributed, as in Theorem 2. Thus, the TSV distribution cost can be defined as
which is a sum of squares of the total TSV power and the lumped cell power in each bin. This quadratic penalty method is an easy-to-use, common method in engineering practice to satisfy the equality constraints. Since the existence of a solution that satisfies both the area density constraint and the TSV distribution constraint is not easy to determine, we only penalize the DIST function by a finite number β instead of pushing it to +∞.
V. Experimental Results
In this section, we implement our algorithms in C++, and run our experiments on an Intel Xeon 2.0 GHz machine with 8 GB RAM under Linux. The common benchmarks in this section are seven opensource IP cores in the IWLS 2005 benchmarks [42] . The circuits are summarized in Table II , where the utility rate (Util.) is the total cell area divided by the total chip area, and the power values will be used in Section V-B.
We synthesize the circuits with a standard cell library for the MIT Lincoln Lab 130 nm 3-D SOI technology. The target 3-D technology is a 4-tier 3-D IC, with TSV size 6 μm×6 μm and TSV pitch 12 μm × 12 μm. The placement area is set as a square with 20% to 28% whitespace in total, and the I/O pins are placed uniformly along the boundaries in alphabetical order.
A. Results on Wirelength-Driven 3-D Placement
First of all, we test the F3D placer, as discussed in Section III, using the IBM-PLACE benchmarks [40] with a cell height of 64 μm. Since the analytical 3-D placer works with a multilevel scheme, we obtain 1-level (flat), 2-level, and 5-level placement results, respectively, as shown in Fig. 7 . The data points on each curve are obtained by setting 1 α TSV to 0.2, 2.0, and 20 times the cell height (64 μm) for the points on the left, middle, and right, respectively. The data show that the F3D placer with a moderate clustering level, labeled as 2-level F3D placement, provides the best placement quality on both HPWL and the TSV number. This is explained as follows.
If the weight of the TSVs is too small (e.g., 0.2×64 μm), the placer tends to ignore the TSV quality, and generates solutions with similar HPWL quality but with various TSV numbers. In such cases, clustering helps reduce the unnecessary TSVs without degrading the HPWL quality, because it reduces the inter-tier connections at the coarse-level placement. On the other hand, if the clustering levels are deep (e.g., five levels), the inter-tier connections are reduced too much; therefore the HPWL reduction cannot benefit much from the intertier connections. The experimental results recommend that we perform a moderate level of clustering in order to obtain the results with fewer TSVs and shorter HPWL.
Based on these results, we tune the parameters of our F3D placer to be a 2-level placer with α TSV be 500 times the cell height. The comparisons between our wirelength-driven F3D and P3D placers are shown in Table III . In the F3D placer, the TSVs are inserted after the 3-D global placement and before the detailed placement. The P3D placer implements In addition, we compare the F3D placer with a state-ofthe-art 3-D analytical placer ntuplace3d [25] . It applies the bell-shaped function, instead of the Huber-based smoothing, to measure the area distribution in the virtual 3-D placement region, but it does not use any global smoothing techniques. The comparisons 2 are listed in Table IV . The F3D placer column shows the placement results from Fig. 7 with 2-level clustering and α TSV = 20 × 64 μm. These data show that the F3D placer can achieve 21% shorter wirelength on average than the ntuplace3d approach with only 8% more TSVs. Although the F3D placer runs slower than ntuplace3d, the average empirical complexity of F3D placer is (N 1.07 ), which is faster than the ntuplace3d's complexity of (N 1.48 ) asymptotically. 2 We obtain the executable of ntuplace3d from the authors, and rerun the experiments on our machines. Please note that we assume zero TSV area in Table IV for both ntuplace3d and F3D. But assuming a TSV has 64 μm ×64 μm area, the TSV area in ntuplaced3d is 28% of cell area on average, and the TSV area in F3D is 29% of cell area on average. There is only 1% difference. described at the beginning of this section. The 3-D chip temperature is measured by the compact model in [37] , assuming that the height of the silicon layer is 300 and 25 μm on the bottom tier and the other tiers, respectively. The power dissipation of each cell is generated as follows. The circuit is partitioned into eight parts by hMetis [27] . Each part is assigned a random number between 0 and 1 as a relative power number. These relative numbers are scaled to power values such that the overall power density is 100 W/cm 2 , which is the power density for high-performance chips at the 14nm technology node projected by ITRS [41] .
B. Results on
2)
Comparison With Other Thermal Optimization Methods: The advantage of our thermal-aware 3-D placement, named P3D-Thermal, is compared with other thermal optimization methods in Table V , including the baseline, the TSV-oblivious method, and the postprocessing method.
The baseline is a wirelength-driven placement generated by solving problem (P3) with β = 0. The bin size for the area constraints D k (x,ȳ) = C k is set to be approximately the same as the average cell/TSV size to capture the overlap in a fine resolution. For the TSV-oblivious and P3D-thermal methods that will be described below, we set the bin size for the thermal cost DIST (x,ȳ) to be approximately 10× the average cell/TSV size, in order to capture this distribution cost in a proper resolution.
The TSV-oblivious method mimics the thermal-aware 3-D placement methods [21] , [38] that do not consider the thermal effects of TSVs. It is able to be implemented by solving problem (P3) with TSV powerp TSV = 0 Although a uniform-power distribution is not a thermal-optimal solution, the difference is only a few degrees according to the Hotspot [26] simulation for 3-D ICs if we ignore the thermal effects of TSVs. Thus, uniform power is a fair replacement for the previous thermal-aware 3-D placement methods without a proper TSV model. In this way, the TSV distribution cost becomes purely a power distribution cost. When the per-tier total power is assumed to be a constant, the minimizer of the cost function in (48) is a uniform per-tier power distribution. The cost weight β is set to 1 in the implementation.
The post-processing method is a direct application of Theorem 2 at the post-placement stage. After 3-D global placement, an optimal TSV distribution is computed according to the power distribution, regardless of overlaps. The assignment of TSVs to the TSV slots in the target distribution is computed by a linear assignment method to minimize the wirelength overhead. The resulting overlaps are removed by a legalization step. P3D-thermal represents our method, which optimizes the TSV distribution during 3-D placement. According to the results in Table V , it is clear that P3D-thermal outperforms the other two optimization methods, and reduces more temperature within a similar amount of wirelength overhead. The average rows in Table V show the average results normalized by the baseline results. P3D-thermal is able to reduce temperature by 34% on average, which is 4× greater than the TSV-oblivious method that reduces temperature by only 8%. Although the post-processing method makes use of the heat conductivity of TSVs, it is likely to cause congestion due to displacement. Thus, the legalized results have either higher temperature, or longer wirelength. Moreover, our P3D-thermal method provides a mechanism for wirelength and temperature tradeoffs, as shown in Fig. 8 . It is the visualization for the results of wb − conmax. The xaxis shows the normalized half-perimeter wirelength (HPWL), and the y-axis shows the temperature. The data points are generated with different β values labeled above the curve, where the left endpoint is generated with the TSV distribution cost weight β = 0.00 and the right endpoint is generated with β = 1.00.
The results for this case demonstrate that our method is able to reduce temperature with a negligible amount of wirelength degradation (e.g., 2%). Thus, it can be applied in the cases when the performance is critical and the acceptable wirelength degradation is limited.
3) Discussions on the Overhead and Extended Scenarios: In terms of wirelength overhead, we've already studied the HPWL impact due to TSV awareness. The results of the routed wirelength (RWL), reported by Cadence Encounter 10.1, are also listed in Table VI . We can see that the ratio between RWL and HPWL is more or less the same for each circuit, no matter if it is thermal-aware or not. These results demonstrate that our thermal-aware 3-D placement does not create extra routing congestion compared with the wirelength-driven placement.
Another effect related to wirelength overhead is the dynamic power. Please note that we made an assumption on the constant power values. In fact the dynamic power relates to the wirelength, especially the wires with high switching activities; the leakage power also depends on the temperature. The wirelength overhead would probably increase the dynamic power, and the temperature reduction would reduce the leakage power of hotspots. It is worthwhile extending our theory on the optimal TSV distribution (Theorem 1 and Theorem 2) to consider both effects.
In terms of temperature reduction, the data in Table V assume a power density of 100 W/cm 2 as stated earlier, so some temperatures are exaggeratedly high. In order to measure the temperature reduction under current technology node, we scale the power density to fit two more scenarios: 1) a highperformance processor Intel Xeon E5-2680 at 32 nm, and 2) an experimental 3-D processor 3-D-MAPS [30] at 130 nm. The former processor has a die area of 416 mm 2 and a TDP of 130 W, resulting in a power density of 31.25 W/cm 2 . The other processor has a total area of 25 mm 2 × 2 and a TDP of 4 W, resulting in a power density of 8 W/cm 2 . The temperature reductions for all these scenarios are listed in Table VII . The results show that it is worthwhile sacrificing up to 8% wirelength when there is more than 20°C temperature reduction, even for low-power designs when there are hotspots due to 3-D stacking. Please note that there are artificial hot clusters in the netlist in our experimental setup.
In terms of TSV styles, we assume that TSVs can be placed in a free style. Under current TSV technology, it is reasonable to consider TSV islands [31] , which relieve the stress to nearby transistors, and save area by reducing keep-out zones. To solve the TSV island-aware 3-D placement problem, it is possible to insert a TSV island formation step after each descent step in the analytical solver, which serves as a projection step in a gradient projection method [34] to satisfy the constraints of TSV arrangements. Due to the space limits, here we only roughly estimate the impact of TSV islands to the temperature reduction of our method. In Table VIII we repeat the P3D-Thermal method with different bin sizes for the thermal cost, which examine the cases when the TSV islands are too large to satisfy the TSV distribution constraints with a fine bin size. The 10× size represents the case when the bin size for the thermal cost is 10× the average cell/TSV size, and the results for coarse bin sizes are also included. Although the thermal optimization becomes inaccurate when the bins get coarse, a more inaccurate solution does not necessarily lead to a worse temperature. Thus, we see different trends when the bin size gets coarse, and the coarsest size (160×) still achieves 72% temperature reduction on average compared with the finest size (10×).
VI. Conclusion
In this paper, we presented our high-quality analytical 3-D placement framework. We proposed Huber-based local smoothing to work together with Helmholtz-based global smoothing for density constraints. The experimental results showed that this analytical approach achieved more than 20% wirelength reduction, on average, than the state-of-the-art ntuplace3d placer with a similar number of TSVs.
Furthermore, we identified a simple criterion for thermally optimal TSV distribution, where the TSV should follow the lumped power distribution. Based on this condition, we enhanced our analytical 3-D placement framework with thermal awareness. The experimental results showed that it effectively reduced the peak temperature with 6% wirelength degradation on average.
