Temperature-tracking is becoming of paramount importance in modern electronic design automation tools. In this paper, we present a deterministic thermal placement algorithm for standard cell based layout which can lead to a smooth temperature distribution over the die. It is mainly based on Fiduccia-Mattheyses partition scheme and a former substrate thermal model that can convert the known temperature constraints into the corresponding power distribution constraints. Moreover, a kind of force-directed heuristic based on cells' power consumption is introduced in the above process. Experimental results demonstrate a comparatively uniform temperature distribution and show a reduction of the maximal temperature on the die.
Introduction
Today, since the increase of operating frequency, bandwidth, and system integration, many high performance circuits consume a considerable amount of power. In fact, this consumed power, in which dynamic power is responsible for about 85 percent according to the 2000 international technology roadmap for semiconductors (ITRS) [1] , would be converted directly into a dramatic temperature increase. This temperature increase not only affects circuit performance by slowing down the transistors on CMOS chips but also decreases chips' lifetime because of the effect of the increased electromigration [2] . Therefore, in recent years, many thermal issues have been considered in various phase of electronic circuit design process. In fact, the thermalaware placement problem studied in this paper is such kind of one. Its object is to minimize the maximum on-chip temperature and obtain a comparatively even temperature distribution during the placement phase of VLSI physical design.
On the basis of Fourier's law and conservation law of energy, the following 3-dimensional (3-D) heat diffusion equation can be derived [3] :
∂T (x, y, z, t) ∂t = ∇[k(x, y, z, t)∇T (x, y, z, t)] + g(x, y, z, t)
where T is the temperature ( • C), g is the power density of heat source (W/m 3 ), k is the thermal conductivity (W/(m • C)), ρ is the density of the material (Kg/m 3 ), c p is the specific heat (J/(Kg • C)). As a matter of fact, this equation expresses a temperature function depending on the time t and the coordinate (x, y, z) of the point.
Transient effects need not to be considered in most of analyses because the damping time of the thermal conduction for modern VLSI chip is several orders of magnitude larger than the operating cycle of the circuits. Thus we only consider the steady state of thermal conduction. Moreover, because the thermal conductivity k could be considered as a constant, Eq. (1) can be simplified further as Poisson Eq. (2) below, which can be regarded as the foundation of many thermal-aware placement algorithms.
The majority of the previous studies on the thermal placement problem can be categorized into four classes on the basis of algorithms used:
• Simulated annealing approach [4] ; • Approximation algorithms [5] ;
• Force-directed algorithms [6] ;
• Partition-based algorithms [7] .
In [4] , the authors propose a simulated annealing approach, even though this approach can handle most of the necessary properties, it suffers from long run time when the circuit size is large. The work of [5] proposes a matrix synthesis approach, which should be classified into the approximation algorithms. In order to find a matrix of cells in which power dissipations on every t × t submatrix is minimal, the authors present three approximation methods to construct it. The largest drawback of [5] is that the effect of the interconnect capacitance on the cell power dissipation is not considered. In [6] , a new force-driven algorithm is used. Combining with an iterative scheme, it can achieve a significant reduction of the maximum temperature and total power consumption on the die, but no minimization of the thermal gradient is included.
The partition-based algorithm is known as one of the most representative algorithms in which divide-and-conquer strategy is used, and it is also used in [7] to deal with the thermal-aware placement problem. The experimental results of [7] verify the effectiveness of this partition scheme, but we could not obtain information about how to confirm the temporary location of cells reasonably on each partition level, which is indispensable to estimate the current cells' power dissipation. In fact, the power dissipation of a logic 
Here, V DD is the voltage of the power source, f is the clock frequency, and s is the switching rate. The total capacitance driven by the gate is composed of two parts: C load = C pin + C net , where C pin is the sum of capacitance of all pins and C net is the interconnect capacitance of the net. C net depends on the locations of the pins because it can be generally computed as C net = l net · c, where l net is the length of the approximated Steiner tree and c is the capacitance per unit length.
In our algorithm, firstly, on the basis of a substrate thermal model, we transform the known temperature constraint into the corresponding power distribution constraint. Next, we perform a Fiduccia-Mattheyses partition scheme, in which an extended force-directed heuristic would confirm cells' location in view of power dissipation optimization, to deal with the thermal-aware placement problem. The objective is to minimize the maximal temperature on the die while making the power dissipation of cells distribute evenly over the chip.
Some Related Work

A Former Compact Thermal Model
Modeling the chip substrate as a 3-D lumped circuit network shown in Fig. 1 , A.J. Chapman [8] used finite-difference method (FDM) to solve the Poisson Eq. (2) and construct an admittance matrix G associating each node i's temperature with power dissipation consumed on it. For details of this model, see [8] . 
The formula (4) in a matrix form is the thermal model we obtained from [4] , where T i denotes the node i's temperature, and P i is the power dissipation on each node i. Assuming that the 3-D substrate mesh of Fig. 1 has m port nodes and n internal nodes, G should be a symmetric matrix of size (m + n) × (m + n) and also positive definite.
Moreover, if G's first m rows and columns correspond to the port nodes on the surface of substrate, a reduced "2-D" admittance matrix Y, which is a "bridge" between port nodes' temperature (T P ) and power dissipation (P P ), can be derived from the original "3-D" admittance matrix G. This reduction is shown as formula (5).
Fiduccia-Mattheyses Partition
Fiduccia-Mattheyses partition is a modified form of Fiduccia-Mattheyses heuristic [9] for partitioning in VLSI circuit physical design. The Fiduccia-Mattheyses partitioning algorithm is as follows: Start with an initial random but "balanced" partition Ω = {A, B}. Evaluate the net cut (the number of nets connecting cells in A to cells in B and are therefore cut by the partition). For all cells belonging to A ∪ B, find the corresponding reduction g in the net cut obtained by moving each cell to the other side, where g is called the gain of this moving. If g > 0, then the moving is beneficial. Choose a free cell b 1 in A with the highest gain g 1 , if moving it from its own side partition A to B preserves the "balance" conditions in Ω, do this movement temporarily and "lock" cell b 1 , otherwise, just "lock" it within A. Then, update the gain value of the rest free cells, and find the new maximum gain g 2 which a cell b 2 in B holds for the next moving. Continue this process alternatively in A and B until the free cells can not be found.
Find a "steps value" k > 0 such that the total gain g total = k i=1 g i is maximized, and move the corresponding cell
Repeat the above process until the maximal g total ≤ 0, that means, there is no improvement on reducing the net cut size.
FM Partition Scheme and Force-Directed Heuristic
In our algorithm, before we perform the partition scheme, we build a mesh model for the substrate as shown in Fig. 1 , then compute the surface's admittance matrix Y, and use it to convert the known temperature constraint into its corresponding power distribution constraint.
As similar to [7] , in order to handle the thermal placement problem under the power distribution constraint, we perform a series of top-down two-way FM partitionings. This scheme had ever been used in well-known EDA tool GORDIAN and led to a good placement quality [10] : Start from the top level block (the whole chip placement), partition every block into two subblocks, and do this recursively until the number of cells contained in each block is less than a certain threshold. In fact, in each level of top-down partitioning, we partition every block according to the process described in Sect. 2.2. Thus, partitioning can be thought of as a successive approximation method for placement. At each level of partitioning, the cells are localized in the region of the chip in which they ought to be finally located, but their exact position is not fixed abidingly. As the circuit is further partitioned and the smaller groups of cells are assigned to smaller chip areas, we get a better approximation of their final coordinates. This scheme is less susceptible to local minima because the coordinates of all modules are being approximated simultaneously, with mutual benefit. Moreover, in terms of [11] , the results obtained from placement by partitioning schemes are second only to simulated annealing. Beside, these algorithms take much less CPU time.
Our thermal-aware placement process can be written by the following pseudo-coding:
THERMAL-AWARE PLACEMENT ALGORITHM
Compute the admittance matrix Y. Estimate the average substrate temperature. Convert the temperature constraint into its corresponding power distribution constraint using the formula (5). Generate an initial placement. Our algorithm has two inputs: a power distribution constraint and an initial placement. In fact, this initial placement is generated randomly. Although our algorithm is aiming at the standard cell based layout, the task of the algorithm's main part is yet to get the rough relative location among the cells. And as soon as these relations are determined finally, we can obtain a final legal placement using some heuristic method. So, here, an initial placement generated randomly is not only reasonable but timesaving.
In our algorithm, the power distribution constraint is combined into two aspects in each level of FM partitioning: One is that, besides the area balance condition in the traditional FM partitioning scheme, another power balance condition, whose objective is to distribute the cells' power consumption uniformly, is taken into account to judge the movement's legality of the current cell having the maximum gain g. The other aspect where the power constraint plays an important role is how to determine the coordinate of the cell reasonably and temporarily. The detail would be illustrated respectively in Sects. 3.1 and 3.2.
Power Balance Condition in FM Partition
A partition Ω = {A, B} is called power (or thermal) balanced if condition (6) is satisfied. r = node i∈A P i node i∈A∪B P i , 1 − r = node i∈B P i node i∈A∪B P i ,
P(A) P(A) + P(B)
≈ r,
In (6), r is a parameter less than 1, P i is the power dissipation constraint on mesh node i of the surface of the substrate, which can be obtained through the admittance matrix Y in the initial stage of our algorithm. P(A) and P(B) are the total power dissipation of the cells in A and B respectively. So, to preserve power balance, a cell b in A is moved to B finally in a pass if:
after moving b. Here P s is a power "slack" that can be adjusted in the experiment. If it is given a too high value, the processing time for partitioning would become shorter, and when the power balance condition is observed very loosely, the power distribution on the die would become far away from the target. Contrarily, a too low value of P s would make some efficient moving be illegal, it is disadvantageous to the algorithm too. In our experiment, we adopt the largest value of cells' power consumption in A and B as P s .
An Extended Force-Directed Heuristic
In fact, there is a precondition for applying the power balance condition into cell's moving, it is to determine the coordinate temporarily and reasonably which the current cell would move to. This is an indispensable condition for estimating the cell's current power consumption. So, in our algorithm we continue to introduce a force-directed heuristic driven by power minimization.
Suppose now we make a partition Ω = {A, B} for block R, and b ∈ A is the cell with the current highest gain in Ω. In order to calculate the power dissipation of relevant cells after b is moved to B (Here, the relevant cells include b and other cells driving b's input pins), we must determine the coordinate where b is located in B.
We use formula (5) to calculate the power dissipation of cells after moving b from A to B. Since the b's coordinate in current placement is just what we want to know, we could not obtain the length of the relative Steiner tree for interconnections. Here, we use net's quadratic length sum to approximate the length of the Steiner tree. Therefore, the total power dissipation in block R = A ∪ B can be expressed as:
Where b is a cell in R, b outpin is the set of output pins of cell b, s i is switching rate of the net which i drives, i driving is the set of other cells' input pins driven by output pin i, and C jpin is the capacitance of pin j. Additionally, (X i ,Y i ) is pin i's coordinate originally, but for simplification, we replace (X i ,Y i ) with the coordinate (X b ,Y b ) of cell b and it is the same for other cells.
It is well known that the right side of (8) is not only quadratic but also convex with respect to coordinate variable (X i ,Y i ), so we can obtain its minimum solution when making its first derivative with respect to the coordinate equal to zero. Looking the b's coordinate (X b ,Y b ) as the dependent variables, the coordinate of other cells in R as the known value, Eq. (9) can be obtained by making the first derivative of (8) equal to zero.
In (9), we suppose cell b has only one output pin, so s b can be regarded as switching rate of its output pin. In addition, k is the number of pins of b's driving net, b driving is the set of other cell's input pins which b drives, C driving b is the set of cells driving the input pins of b.
Thus, in our algorithm, utilizing the interconnection between moving cell b ∈ A and other cells, we can apply Eq. (9) to determine the coordinate where cell b should be located. Certainly, if this coordinate exceeds the area range of B, we project b in the nearest location in B.
In fact, the formula (9) and traditional force-directed formula which is used in the placement problem whose objective is minimizing wire length not only have a similar form, but also have some similar meaning: The cells that are connected by nets exert an attractive force on each other (Fig. 2) . The magnitude of the force between any two cells is directly proportional to the distance between the cells, as in Hooke's law for the force exerted by stretched springs. If the cells in such a system were allowed to move freely, they would move in the direction of the force until the system achieves equilibrium at a minimum energy state.
On the other hand, after calculating the current cell's coordinate using formula (9), we will adjust it properly to avoid the overlap among the cells. In fact, this embodies a repulsive force among the cells.
Other Relative Issues in the Algorithm
There are a few other issues which we specify as follows: (1) In the extended force-directed heuristic, the length of the Steiner tree is replaced with the quadratic sum of the length of the net. We can say this is not a good approximation if we want to calculate the cells' real power dissipation in this way, but here when we confirm the location of moving cell, we just need some relative location information among cells but not the concrete Steiner tree's length. Moreover, like the Steiner tree, quadratic length is also a kind of measurement method of net length. Therefore, this approximation should be permitted here. (2) At the beginning of the algorithm, a flat or uneven temperature distribution constraint on the die can be provided practically by the user. In addition, an objective flat temperature profile could be obtained as the experiment in this paper. Firstly, we can generate a few numbers of random placements and obtain the average total power dissipation. Then, like [4] , estimate the average substrate temperature from the above average total power consumption using the chip operating temperature formula mentioned in the 1997 National Technology Roadmap for Semiconductors. (3) When converting the final power distribution to a final temperature one, we also use the formula YT P = P P in matrix form. Since Y is positive definite same as G [7] , we can use Cholesky factorization method to obtain T P .
Experimental Results
We implemented our algorithm in C language. The simulation experiments were conducted on a platform consisting of a PC machine with an Intel Celeron 2.20 GHz CPU and 512 MB RAM. The operating system is Windows XP.
We applied our algorithm to three standard cell layout test cases (fract, struct and biomed) in MCNC benchmark suite [12] , whose information are listed in Table 1 .
As the same as [7] , we proceeded our experiments using zero row spacing, which is the style for modern layout designs. In the experiments, the number of mesh lines used in x, y and z directions are 11×11×7. Thermal conductivity for the packaging is assumed to be 7 W/m
• C for the sides, 2000 W/m
• C for the top, 8800 W/m • C for the bottom, and 150 W/m
• C for the silicon for all test cases. The ambient temperature is assumed to be 0
• C. The switching activity randomly generated between 0 and 1 for each net. Besides the capacitance values of cells' input pins which are provided by test cases originally, the capacitance value of some relative I/O pin is assumed to be 0.1 pF, and the wire capacitance is assumed 242 pF/m. Moreover, the clock frequency is assumed to be 800 MHz. Table 1 shows the experimental results for three test cases, in which the comparison between runs with and without thermal constraints are shown. In fact, we proceed 50 times for the experiments with and without thermal constraints respectively, and the result in Table 1 lists the average value of 50 runs. Here, T max is the maximum temperature across the chip, L total is the total wire length of the final placement, which is measured using Euclidean distance between two pins. In addition, the average expended T ime for each test case is also listed. From the results,we can see that when the wire length is 2.8% larger at most than that of placement without thermal constraints, our thermal-aware placement can reduce the maximal temperature on the die. Figures 3, 4 and 5 indicate that our thermal-aware placement can get a comparatively uniform temperature profile on the die than the placement without thermal constraints for all three test cases. In these figures, X − Y plane represents the chip plane, and the unit of X, Y axis is µm. Z axis represents the temperature value of each point on the chip, its unit is
• C. Here, we emphasize that Figs. 3, 4 and 5 is chosen from our experiment results and its value of T max nearly equals Table 1 's corresponding average one.
A partition-based thermal-aware placement algorithm presented in [7] uses a multigrid-like heuristic at each level 
Table 2
Comparison with the partition-driven thermal-aware placement algorithm in [7] . (based on the incremental value between respective runs with and without corresponding thermal constraints)
Increment of Thermal Algorithm in [7] Increment of Our Thermal Algorithm Name
of partitioning to guide the temperature's distribution. So a comparison between the experimental results of this algorithm and ours deserves to be done. And in order to avoid the effect resulted from the difference of the data's dealing way on some phase, it should be more reasonable to compare the relative values of experimental results. Table 2 compares the processing ability of these two kinds of thermal constraints (by the way, the fract benchmark does not be simulated by [7] ). Here, all data listed are the incremental percentage values between the respective runs with and without corresponding thermal constraints. In practice, the run without the thermal constraints means a pure traditional two-way partitioning scheme. From this comparison, we can see that our thermal constraint is also effective as a whole.
In additional, although [7] does not give the final total power consumption and placement area, in order to illustrate clearly we also lists their incremental values resulted from our algorithm in Table 2 . On the one hand, we can see that the power consumption P total increases along with L total 's increasing as we analyzed before. On the other hand, because we adopt zero row spacing style for the standard cell based placement and row number of the placement had been set well at beginning of the algorithm, so area's change is very little.
Conclusions
In this paper, we have described a temperature-aware placement algorithm. It is based on a former substrate thermal model and Fiduccia-Mattheyses partitioning scheme, in which on each partitioning level an extended force-directed heuristic is used to confirm cells' location temporarily in view of the power dissipation optimization.
The experimental results have demonstrated this algorithm's effectiveness in reducing the maximal temperature and obtaining a uniform temperature profile on the die.
A direction of our future work for handling the thermalaware placement is trying to design analytical approaches which can ensure the solution quality mathematically and incorporate various constraints more easily. In addition, since the current passing through power/ground nets is large, the optimization of these nets should be especially considered in the thermal-aware placement. 
