Abstract: The density of chip power dissipation has been increasing steadily over the past several years. High operating temperatures and the existence of hotspots are degrading chip performance and undermining chip reliability. Reducing maximum on-chip temperatures is becoming increasingly important as technology scales below 65 nm. Existing thermal floorplanner compact blocks at the lowest leftmost position allowed by the floor plan encoding. Such compaction minimises chip area but is sub-optimal for wire length and thermal objectives. It is possible to move the blocks in the whitespace (unoccupied chip area) to minimise maximum on-chip temperature without affecting the overall chip area and with a minimal wire length increment of 2 -3%. However, reallocation of whitespace for thermal optimisation has not been addressed by researchers to date. Here, the development of a constrained particle swarm optimisation algorithm to find an optimal solution to the problem has been described. Simulation results on MCNC benchmark circuits indicate that this method can reduce the maximum on-chip temperature of thermal-aware floor plans by 0.58 -7.108C.
Introduction
Floorplanning is an important stage of the very large-scale integration circuit design cycle. Given a set of rectangular blocks, floorplanning determines a non-overlapping placement of the blocks to minimise chip area and wire length. In order to facilitate hierarchical design, the classical rectangle packing problem has been replaced by a fixedoutline soft block packing problem [1] . Minimising the Floorplan whitespace, which is the chip area unoccupied by blocks, is therefore no longer an issue. In fact, a minimum amount of whitespace is essential for gate sizing, buffer insertion, congestion improvement and cross-talk noise reduction. As technology scales below 65 nm, it is likely that the minimum requirement for whitespace will be dictated by thermal objectives. However, there is no existing method for reallocation of whitespace for thermal optimisation. In this paper, we demonstrate that it is possible to lower the operating temperature by redistributing whitespace.
The floorplanning problem has been studied in depth over the past two decades. Introduction of placement constraints, net-length constraints and thermal objectives have increased the complexity of floor plan optimisation. Simulated annealing (SA) [1] , genetic algorithms (GA) [2] and multiobjective evolutionary algorithms [3] are popular in determining the optimal floor plan. These methods model the block placements using structures such as the sequence pair (SP) [4] , bounded slicing grid [5] , O-tree [6] and B * tree [7] representations.
In recent times, the microprocessor clock frequency has doubled every three years, and supply voltage scaling has been unable to stem the rise in power dissipation [8] .
To keep designs within power budget, multi-core processors are being launched and convex-optimisation based processor speed control techniques are being developed to keep the hotspot temperature in control [9] . However, the benefits of whitespace reallocation still remain to be exploited in both single-core as well as multicore processors.
Reducing the peak temperature is important because high temperature has severe detrimental effects on the performance of the chip. First, an increase in chip temperature deteriorates the chip's reliability due to electromigration. Second, due to the temperature dependence of carrier mobility, the current driving capability of transistors decreases with increase in temperature. Also, high temperature increases interconnect resistivity. Low current driving capability of transistors coupled with high interconnect resistivity may increase interconnect delay to such an extent that timing constraints might be violated. Furthermore, high temperature increases the risk of a thermal runaway caused by its exponential dependence on leakage power. In order to avoid failures, thermal packages are designed to withstand peak power dissipation. As the peak temperature increases, so does the cost of cooling. The estimated increase in the overall cost of chip due to every watt of power dissipated above 35-40 W is $1/W [10] . The motivation behind whitespace reallocation is to improve performance and reliability and reduce packaging cost by reducing the maximum on-chip temperature.
Previous research
A wealth of literature exists on thermal optimisation of 2D integrated circuits (ICs) that aid in reducing the packaging cost. Such cooling solutions are provided either at the either package-level, chip-level or architecture-level. At the package level, improved cooling systems (heat sink, air circulation) can significantly cool the chip. A popular chiplevel technique is dynamic thermal management (DTM) [11] , which is a major design feature of the Intel Pentium 4 microprocessor [12] . The key observation behind DTM is that typical applications running on a processor would very rarely cause the processor temperature to ramp up beyond a certain threshold value. In DTM-enabled processors, the thermal package is targeted for the threshold temperature which is much less than the maximum attainable temperature. In the rare event when the on-chip sensors detect that the maximum temperature is approaching the threshold value, a controller dynamically throttles the frequency of the chip to reduce dynamic power dissipation and runs the processor at a lower frequency until the thermal virus section of the application is executed. Thus, DTM reduces packaging cost, albeit by sacrificing performance. Architecture level techniques [13 -15] have the advantage of reducing temperature without slowing down the processor. The importance of microarchitecture in reducing the hotspot temperature has been underscored in [13] . Skadron et al. have developed a compact-substrate thermal model to accurately simulate the temperature on ICs. The model is available in the form of a tool named HotSpot. Thermal aware floorplanning typically involves the use of HotSpot within a GA floorplanner [14] or an SA floorplanner-Hotfloorplan [15] . Fast thermal floorplanning techniques use thermal metrics computed from heat diffusion [16] or powerdensity gradients [3] . To date, all research on thermal floorplanning has focused on determining the relative positioning of the blocks that minimises the peak temperature. Thermal-aware floor planners usually compact blocks towards the bottom-left corner to minimise chip area. Such compaction is suboptimal for minimising thermal objectives. We propose a method by which the peak on-chip temperature can be minimised by determining the optimal positioning of the blocks (or processors) in the whitespace.
Whitespace reallocation is not uncommon in physical design. Whitespace planning is utilised for improving stability of placement algorithms used in physical synthesis [17] . Post-floorplanning whitespace reallocation has been used for wire length minimisation [18] . Using the semiperimeter of the bounding box as an approximation for the net-length, it is possible to formulate the postfloorplanning wire length minimisation (by whitespace reallocation) problem using linear programming and solve it with the help of a min-cost flow-based algorithm [18] . Other whitespace reallocation methods for optimising various non-thermal objectives are provided in [19, 20] . However, reallocation of whitespace to optimise maximum on-chip temperature has not been reported so far. In this paper, we present a heuristic based on Swarm Intelligence to locate an optimal thermal placement. We select the SP floor plan representation because of its ability to represent non-slicing floor plans. However, our method can also be easily extended to other floor plan representations.
SP representation
A floor plan with N blocks can be represented by a pair of sequences (S 1 , S 2 ) each having N integer elements. This pair imposes certain constraints on the relative positioning of the blocks on chip. For every element i in the sequence, we can find sets B(i) and L(i) denoting blocks that are before (left of) i and lower than i in the floor plan, respectively, where
Based on the constraints imposed by the set B(i), a horizontal constraint graph GH(V, E) can be constructed as follows:
The vertex set V consists of a source s, sink t and N vertices labelled with module names. The edge set E consists of directed edges. Edges are drawn from source to each of the vertices. Each vertex in turn is connected to the sink t. A directed edge exists from vertex j to i in GH(V, E) iff j [ B(i). Vertex weights are zero for source and sink and equal to the module width for the rest of the vertices. The second part of the SP is represented by a vertical constraint graph GV(V, E), which can be constructed in a manner similar to the horizontal constraint graph. For this graph, a directed edge exists from vertex j to i in
252
The x-coordinate of the lower left corner of a particular block in the floor plan represented by (S 1 , S 2 ) is the length of the longest path from source to the vertex representing the block. The longest path from source to sink in GH(V, E) represents the minimum width of the packing imposed by the SP (S 1 , S 2 ). The height of the packing can be calculated in a similar manner for GV(V, E). Fig. 1 shows a floor plan along with its corresponding horizontal and vertical constraint graphs.
Problem formulation
Mathematically, the problem of thermal optimisation by whitespace reallocation can be stated as follows: Given a set of N hard modules (macro cells or processors)
is the width and h i is the height of module i}, a set of power values F ¼ {w i | w i is the average power dissipated by block i} and an SP (S 1 , S 2 ) imposing certain topological constraint on the blocks, find a non-overlapping placement of P given by P ¼ {(x i , y i )| x i and y i are the coordinates of the lower left corner of module i} that minimises the maximum on-chip temperature T min T :P <
subject to specific constraints on the width and height of the chip. The exact nature of these constraints depends on the mode of floorplanning and is explained later in this section.
On-chip temperature can be calculated analytically from the average power dissipated in the modules. The generalised equation for a transient temperature analysis in a 3D substrate is given by [21] 
subject to the boundary condition
where T is the temperature as a function of the position vector r and time t; k, r and C p are the thermal conductivity (W/m8C), density (Kg/m 3 ) and specific heat ( J/kg8C) of the material respectively, Q is the heat generation rate (W/m 2 ), ∂/∂n represents the differentiation along the outward normal drawn at the boundary surface and f i is any arbitrary function. At steady state
Substituting (6) in (4) and neglecting the temperature dependence of thermal conductivity, the equation for computing the steady-state temperature becomes
In view of the complex boundary condition involved, (4) is solved most commonly using numerical techniques such as finite difference method or finite-element method. An analytical solution of (7) can be obtained by simplifying the boundary conditions. The temperature profile on a planar substrate with insulated bottom face containing a single power-dissipating module is obtained by solving (7) and is given by [22] u(a, r) = c 1 I 0 (mr
where r is the radial distance from the centre of the module; W, H, P m , k and t are the width, height, power consumption, thermal conductivity and thickness of the module respectively; I 0 and I 1 are the zeroth and the first order modified Bessel function of the first kind; K 0 and K 1 are the zeroth and the first order modified Bessel function of the second kind. Equation (8) suggests that the temperature falls off like a Gaussian surface with the radial distance. In the presence of more than one power dissipating module, the overall temperature at any point is obtained by superposition of the temperature values produced by each individual power-dissipating module. The equation for temperature T i at the centre of module 
T ij (9) where T ij is the contribution of temperature at the centre of module i because of power dissipation in module j, as calculated using (8) . Evidently, each term T i being the sum of Gaussian-like function is itself a non-linear function of the block location. Thus, the objective function (T) of the optimisation problem (3) is non-linear.
In general, there are three constraints that have to be satisfied to generate a valid placement: First, modules should not overlap each other. Second, the modules must be packed within a rectangle of specified chip dimensions (in fixed-outline floor planning) or the boundaries imposed by compaction of modules towards the lower left corner (in conventional floor planning). Third, the relative order of the blocks as obtained from floor planning must be preserved. Mathematically, the constraints may be stated as
where W chip and H chip are the width and height of the chip, respectively. The non-overlapping constraints are imposed by (10) and (11); chip-boundary constraints are imposed by (12) and (13) . The relative ordering constraint is imposed by the SP (S 1 , S 2 ) of the floor plan. The above conditions make the solution space highly constrained.
The non-linear nature of the thermal objective and the various constraints involved render reallocation of whitespace for thermal optimisation to be a constrained non-linear optimisation problem (CNOP) [23] . Owing to the complexity and unpredictability of CNOP, there is no deterministic solution to this problem. Because of the nondiscrete nature of the solution space, optimisation techniques like GA and SA cannot be applied to solve this problem. We, therefore, develop a constrained particle swarm optimisation (CPSO) framework to solve this problem.
Particle swarm optimisation
Particle swarm optimisation (PSO) introduced by Eberhart and Kennedy [24] is a swarm intelligence technique inspired by the social behaviour of bird flocking. A population of collaborating agents (particles) flies in a multidimensional search space. These agents have a common fitness function that they want to minimise by locally searching the landscape and globally coordinating with each other. Each particle in the swarm retains the memory of the best position (locally lowest fitness) it has encountered in the past and is aware of the global best position (globally lowest fitness) of the swarm. Based on this information and the inertia of its motion, particles keep updating their velocity in search of global minima. The procedure continues for several iterations and the global best position is selected as the final solution. Our CPSO flow is shown in Fig. 2 . In this problem, each particle in the swarm moves in a constrained 2N + 2 dimensional space, where N denotes the number of blocks in the floor plan. The position P r and velocity V r of each particle r in the swarm are represented by vectors as follows
Each particle actually represents a floor plan and therefore (x 
Legal random initial placement
In order to start the CPSO procedure, the position of each particle in the swarm must be initialised. Positions obtained by random initialisation are mostly inconsistent with the constraints mentioned earlier. Thus, random initialisation is excessively time-consuming. We present an algorithm for generating feasible initial solutions. A key concept that we use here is that of a block slack [1] . The slack of a block in (13) can be computed using the pseudo-code SP_EVAL_REV presented in [1] . The slacks of the blocks can then be computed using
The method for legal random initialisation of position and velocity is presented in the pseudo-code LEGAL_RAND_ INIT (Fig. 3) . The algorithm starts off by initialising the coordinates of the source vertex to zero. The variable R i and T i represents the minimum value of the xcoordinate and y-coordinate of block i computed after placing all blocks previous to it in the constraint graphs. Thereafter, the effective slack is computed using the following equations
Clearly, any value of x i in the range [R i , R i + XSlack i eff ] will satisfy all the constraints. All components of the velocity vector are initialised to zero, and the local best position of each particle is initialised by the current position vector of the particle. The global best position is initialised by the position vector of the particle having the smallest fitness value. The procedure for fitness assignment will be discussed later.
Update position and velocity vector
The general rules for updating the position and velocity vector in a PSO problem are straightforward and are given by the following equations
where w is the inertial constant, c 1 and c 2 are constants that dictate the proportion of cognitive and social components in the velocity vector. At any instant of time, the velocity vector can be decomposed into three components. The first component is along the direction of velocity vector in the previous iteration and is known as the inertial component. The second component is along the direction of the local best position w.r.t current position, termed as the cognitive component. The remaining component is along the direction of the best known global position w.r.t the current position and is coined as the social component. The relative proportion of w, c 1 and c 2 affect particle behaviour by determining its tendency to trust a particular direction in comparison with the others. For example, a high c 1 to c 2 ratio indicates a less social particle because it trusts its own knowledge over collective knowledge and has greater tendency to search around its own local minima. The mechanism of position and velocity updates is explained in Fig. 4 . R 1 and R 2 are random vectors, with each component being a uniform random number between 0 and 1. Updating the position and velocity vector according to (20) and (21) does not necessarily restrict the particle within the feasible region. It has to be ensured that particle position is updated only if the newly generated position vector lies in the feasible region. Because of extreme limitations in degrees of freedom of each particle in this problem, the particles generally freeze once they violate the legality criteria. Two steps are adopted to circumvent this problem. First, the particle velocity vector is set to zero whenever the particle hits the feasible boundary. This causes the particle to loose inertia, thereby preventing it from stagnating in subsequent iterations. Second, only those components of velocity vector are updated which have considerable slack associated with them. See UPDATE_POS_AND_VEL (Fig. 5) below.
Create whitespace blocks
Simulating temperature at block level granularity requires splitting up the whitespace into rectangular blocks. Whitespace blocks are treated as ordinary blocks with zero power dissipation. The method for splitting whitespace into rectangles is a two-step process: (i) Find whether whitespace exists at the lower-right (LR) and upper-left (UL) corner of each block and accordingly modify the appropriate flags associated with the block, (ii) depending on the status of the flags, create whitespace block m (as shown in Fig. 6 ) at the LR and/or UL corners until all whitespace has been enclosed by rectangles. CREATE_WHITESPACE_LR (Fig. 7) provides the pseudo-code for creating a whitespace rectangle at the LR corner of a block.
Parameter selection
In the PSO algorithm, there are several parameters that need to be tuned. The population size is usually between 10 and 30. In this case, a population size of 10 gives a sufficiently good result in a small runtime. The inertial constant w, cognition learning rate c 1 and social learning rate c 2 are set to 0.98, 0.99 and 0.97, respectively. The termination of the CPSO procedure can be defined by the user. We set the maximum number of iteration to 400 and decide to terminate the process if fitness has not improved in the last 50 generations.
Assign fitness
To assign fitness values, we integrate HotSpot [13] with our CPSO. HotSpot uses the well known duality between electric and thermal quantities to generate an equivalent RC model of the chip. For the steady-state temperature analysis, the equivalent circuit consists of a resistor circuit as shown in Fig. 8 . Power dissipating blocks are modelled as constant current sources and the lateral heat diffusion is modelled by resistors. Given the location of each module, whitespace blocks and their power consumption, HotSpot calculates the temperature at each node (centre of the block) by solving nodal equations for the equivalent circuit. After temperature evaluation, particle r is assigned a fitness value f r given by
6 Simulation results
At first, we used Hotfloorplan [15] to generate initial floor plans for MCNC benchmarks circuits. Block power densities were randomly assigned using a uniform distribution with mean m. Present generation high performance ICs have m in the range of 1 -2 W/mm 2 , and m is expected to reach 5 W/mm 2 for technologies below 50 nm [8] . We present our results for various values of m from 0.5 to 4 W/mm 2 . Table 1 shows the reduction in maximum on-chip temperature (DT ) due to whitespace reallocation when m ¼ 1 W/mm 2 . For large benchmarks like hp, ami33 and ami49, DT was found to be 3.46, 5.87 and 7.108C, respectively. On the other hand, apte and xerox had minimal room for thermal optimisation because of very small slacks associated with the blocks after floor planning.
In the next set of experiments, we analysed in detail the effect of mean power-density and mode of floor planning on DT and also the wire length trade off involved. Initial floor plans generated by Hotfloorplan [15] were fed to the PSO for the whitespace reallocation. In Mode-I (fixedoutline floor planning), initial floor plans were made to satisfy a fixed-outline constraint and were additionally optimised for wire length (W ) and maximum on-chip temperature (T ). The fixed outline constraint implicitly imposes area and aspect ratio constraint on the initial floor plan. Table 2 shows the results of applying CPSO on an initial floor plan of ami33 benchmark circuit. The initial floor plan had 20% whitespace and unit aspect ratio. DT was 0.858C for m ¼ 0.5 W/mm 2 . As m was increased, the benefits of whitespace reallocation became more pronounced. For m ¼ 4 W/mm 2 , DT was 6.658C. This indicates a 7.8× increment in DT for an 8× increment in m, an almost linear increase. In Mode-II, the whitespace reallocation of the same initial floor plan was done, but this time taking the boundaries of the left-compacted floor plan as the chip boundary. For m ¼ 4 W/mm 2 , DT was 3.58C which is much less than the corresponding DT value in Mode I. Observe that for m ¼ 4 W/mm 2 , the internal whitespace (IW) of 16.21% contributes towards 3.58C reduction in T and external whitespace (EW) of (20 -16.21)% ¼ 3.79% contributes towards (6.65 -3.5)8C ¼ 3.158C reduction in T. Clearly, EW is more effective in reducing T. The intuitive reason behind this is that the availability of IW does not necessarily guarantee the availability of a block slack. On the contrary, EW increases available slack for all modules thereby giving the CPSO more degrees of freedom for temperature optimisation. The average increase in HPWL due to post floor planning thermal optimisation was 3.03% for fixedoutline floor planning and 2.67% for conventional floor planning of ami33. Table 3 shows the results on ami49 benchmark circuit. The initial floor plan had 20% whitespace and unit aspect Figure 8 Resistor network for thermal simulation which is much less than the corresponding DT value in Mode I, although the difference in whitespace availability is only 4.63%. We conclude that a small percentage of EW is capable of considerable reduction in T. Thus the trend towards hierarchical design using fixed outline floor planning can actually benefit from this method. The average increase in HPWL was 2.34% for fixed-outline floor planning and 2.29% for conventional floor planning of ami49. t indicates the ratio of DT and DHPWL due to the whitespace reallocation. In all cases, increase in HWPL was 2 -3%, which is negligible unless it affects the critical nets. If the initial floor plan has critical nets that cannot afford minor net-length increment, then the CPSO procedure can also be carried out. In that case, all blocks connected to the critical net are pinned during particle position and velocity update.
The runtimes on a 1.8 GHz P4 with machine with1 GB RAM is reported in Table 1 . Smaller runtime for ami49 as compared with ami33 is because the CPSO procedure was saturated in a lesser number of iteration and was terminated early. Fig. 9 shows thermal plots of ami49 before and after the whitespace reallocation. The hottest spot in the leftcompacted floor plan (87.358C) was at the centre of module M004. After the whitespace reallocation, the hottest spot shifted to the centre of the module M001 and the new hotspot temperature was reduced to 83.118C.
Conclusion
In this paper, we present a robust method for thermal optimisation at a post floor planning stage. The method is capable of reducing the maximum on-chip temperature of floor plans that have already been optimised for temperature during floor planning. The method could also be applied for optimal placement of processors in a multicore chip. The whitespace reallocation can reduce the chip temperature by few degrees to several tens of degrees depending on the availability of module slack, EW, powerdensity distribution and mode of floor planning. The efficacy of whitespace reallocation is expected to become more pronounced as the device size shrink, power density escalates and design complexity pushes floor planning to follow a hierarchical fixed-outline methodology.
Acknowledgment
The authors thank Dr Igor Markov, University of Michigan and Dr Subhashish Mitra, Stanford University for their helpful comments and suggestions.
9 References
