Abstract-Three-dimensional (3-D) packaging via system-onpackage (SOP) is a viable alternative to system-on-chip (SOC) to meet the rigorous requirements of today's mixed signal system integration. In this article, we present the first physical design algorithms for thermal and power supply noise-aware 3-D placement and crosstalk-aware 3-D global routing. Existing approaches consider the thermal distribution, power supply noise, and crosstalk issues as an afterthought, which may require an expensive cooling scheme, more decoupling capacitors (=decap), and additional routing layers. Our goal is to overcome this problem with our thermal/decap/crosstalk-aware 3-D layout automation tools. The traditional design objectives such as performance, area, wirelength, and via are considered simultaneously to ensure high quality results. The related experimental results demonstrate the effectiveness of our approaches.
I. INTRODUCTION

S
EMICONDUCTOR industry is beginning to question the viability of system-on-chip (SOC) approach due to its lowyield and high-cost problem. Recently, three-dimensional (3-D) packaging via system-on-package (SOP) [1] - [3] has been proposed as an alternative solution to meet the rigorous requirements of today's mixed signal system integration. 1 The SOP is about 3-D integration of multiple functions in a miniaturized package achieved by thin film embedding. The 3-D SOP concept optimizes integrated circuits (ICs) for transistors and the package for integration of digital, RF, optical, sensor and others. It accomplishes this by both build-up SOP, similar to IC fabrication, and by stacked SOP, similar to parallel board fabrication. The uniqueness of 3-D SOP is in the highly integrated or embedded RF, optical or digital functional blocks, and sensors, in contrast to stacked ICs and stacked package. Due to the high complexity in designing large-scale SOP under multiple objectives and constraints, computer-aided design (CAD) tools have become indispensable. Thus, innovative ideas on CAD tools for Manuscript received November 26, 2004 ; revised July 30, 2005 . This work was supported by Grants from MARCO C2S2 and NSF CAREER under CCF-0546382. This work was recommended for publication by Associate Editor C. Amon upon evaluation of the reviewers' comments.
The authors are with the School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA (e-mail: limsk@ece.gatech.edu).
Color This article tackles the three most crucial issues that threaten the reliability of SOP paradigm: thermal distribution, power supply noise, and crosstalk. First, thermal issues can no longer be ignored in high performance 3-D packages due to higher power densities and other issues. High temperatures not only require more advanced heat sinks, they also degrade circuit performance. Interconnect delay increases with temperature, which degrades circuit timing. If timing deteriorates enough, logic faults can occur. Hence thermal issues must be considered early-on in the design process. Second, the continuing trend of reducing power supply voltage has resulted in reduced noise margin, which effects reliability and may even cause functional failures due to spurious transitions. Active devices in 3-D packaging draw a large volume of instantaneous current during switching, which causes simultaneous switching noise (SSN). Third, due to the scaling down of device geometry in deep-submicron technologies, the crosstalk noise between adjacent nets has become a major concern in high performance packaging design. Increased coupling noise can cause signal delays, logic hazards and even malfunctioning of the designs. Thus controlling the level of crosstalk noise in 3-D packages chip is an important task for the designers.
However, existing approaches consider these issues as an afterthought, which may require expensive cooling schemes, more decoupling capacitors decap , and more routing layers. In addition, many time-consuming iterations are required between full-length thermal/SSN/crosstalk simulation and manual layout repair until the convergence to a satisfactory result. We note that the placement of modules in 3-D SOP design has huge impact on thermal distribution and SSN while the routing of the signal nets has direct impact on crosstalk. Therefore, the goal of this article is to present the first physical layout algorithm for 3-D SOP that combines thermal/decap-aware 3-D placement and crosstalk-aware 3-D global routing algorithm. The traditional design objectives such as performance, area, wirelength, and via costs are considered simultaneously to ensure high quality results. The related experimental results demonstrate the effectiveness of our approach.
The remainder of this paper is organized as follows. Section II discusses previous works. Section III presents the problem formulation and an overview of our 3-D algorithms. Section IV presents the SOP placement algorithm. Section V presents the SOP routing algorithm. Section VI presents the experimental results. We conclude in Section VII.
II. RELATED WORKS
The physical design for 3-D integration has drawn interest from both academia and industry recently. Unlike the active physical design research effort in mixed signal SOC [4] , [5] and 3-D stacked die technology [6] - [8] , however, the research for 3-D SOP considerably lacks behind due to its short history. The physical design research for SOP has been recently pioneered by the [9] - [14] . Decoupling capacitor placement is done for two-dimensional (2-D) printed circuit board (PCB) designs [15] - [17] . Thermal-aware module placement algorithms are proposed for PCB designs [18] , [19] . Many multichip module (MCM) routing algorithms have been proposed in the literature [20] - [25] . Several works on MCM pin redistribution include [26] - [28] .
There exists several similarities and differences between MCM and SOP routing. In general, the design objectives and constraints in both MCM and SOP routing are quite similar, where performance, wirelength, via, layer, and crosstalk optimization are crucial. In addition, the routing process is divided into multiple steps: pin distribution, topology generation, layer assignment, and detailed routing. A notable difference between SOP and MCM routing lies in the fact that there exists multiple device layers in SOP, whereas in MCMs there is only one device layer. Therefore, nets are now connecting pins located in all intermediate layers in SOP, and the blocks in each layer behave as obstacles. In MCMs, however, all pins are located only at the top layer, and there exists no obstacles except for the wires themselves. This makes SOP or 3-D package routing problem more general than MCM routing.
In SOP designs, some pins in each device layer can be distributed either to the layer above or below. In addition, some nets that connect blocks in nonneighboring device layers need to penetrate intermediate device layers. Since the blocks in these intermediate layers become obstacles, designers use routing channels in each device layer to insert "feed-through vias." Thus, we have to determine which routing channel to use for feed-through vias. These routing channels are also used for intra-layer connection. Thus, after feed-through via insertion is done, we need to decide which remaining channels to use for the intra-layer connection. We then need to finish connection between the original and distributed pins. In case the pin location is not known for some "soft blocks," we need to determine their location either along the boundary or underneath the block during our local routing step.
III. PROBLEM FORMULATION
A. SOP Layer Structure
The layer structure in multilayer SOP is illustrated in Figs. 1 and 2. The placement layers 2 contain the blocks (such as ICs, embedded passives, opto-electric components, etc), which from the point of view of physical design are just rectangular blocks with pins along the boundary. The interval between two adjacent placement layers is called the routing interval. A routing interval contains a stack of routing layers sandwiched between pin distribution layers. These layers are actually -routing The block and white dots respectively denote the original and redistributed pins. The "x" denotes a feed-through pin for an x-net to pass through a placement layer using a routing channel. The solid, dotted, and arrowed lines, respectively, denote signal wires, vias, and feed-through vias.
layer pairs so that the rectilinear partial net topologies may be assigned to them. The pin distribution layers in each routing interval are used to evenly distribute pins from the nets that are assigned to this interval. Then these evenly distributed pins are connected using the routing layer pairs. Each placement layer consists of a pair of -routing layers, so routing is permitted. A feed-through via is used to connect two pin distribution layers from different routing intervals. Thus, the routing channels in each placement layer are used for two purposes: i) accommodate feed-through vias and ii) perform local routing, where limited number of intra-layer connections are made.
In the SOP model the nets are classified into two categories. The nets which have all their terminals in the same placement layer are called -nets, while the ones having terminals in different placement layers are -nets. The -nets can be routed in a single routing interval or indeed within the placement layer itself. On the other hand, the -nets may span more than one routing interval.
We use the floor connection graph (FCG) [29] , illustrated in Fig. 3 , to model the placement layer, where each routing channel becomes an edge and each channel intersection point becomes a node. In addition, each soft block becomes a node, and pin assignment edges are added to this node to connect to all adjacent channels. This model allows us to determine which boundary to use for the pins with unknown location. Each channel edge is associated with i) via capacity for feed-through vias and ii) wire capacity for local routing. In addition, pin assignment edges have pin capacity for each boundary of a soft block. The routing and pin distribution layers in each routing interval are modeled with a standard 3-D grid, where each node represents a routing region and each -direction edge represents each horizontal/vertical boundary among the regions. Each -direction edge represents a group of vias each region can accommodate. Thus, all edges in this 3-D grid are associated with capacity: and edges for wire capacity, and edges for via capacity.
B. SOP Placement Problem
The following are given as the input to our 3-D SOP placement problem: i) a set of blocks that represent the various active and passive components in the given SOP design, ii) width, height, and maximum switching currents for each block, iii) a netlist that specifies how the blocks are connected via electrical wires, iv) , the number of placement layers in the 3-D packaging structure, and v) the number of power/ground signal layers along with the location of the power/ground pins.
For each net from a given netlist , let denote the wirelength of . The wirelength is the sum of Manhattan distance in , and directions, where the direction is the height of the associated vias. 3 Let , and , respectively, denote the width, height, and area of the placement layer . Let , i.e., the product of the maximum width and the maximum height among all placement layers. Let . Let and denote the average width and height among all placement layers. Let , i.e., the sum of the deviation of the dimension among all placement layers. Last, the area objective of 3-D SOP placement, denoted , is the weighted sum of , and . Let denote the maximum temperature among all blocks. Let denote the total amount of decoupling capacitance required to suppress the simultaneous switching noise (SSN) under the given tolerance value. Last, the goal of the SOP Placement Problem is to find the location of each block in the placement layers such that the following cost function is minimized while SSN constraint is satisfied: 4 (1)
In addition, decaps are required to be placed adjacent to the blocks that require them.
C. SOP Routing Problem
For each net from a given netlist NL, let and , respectively, denote the amount of crosstalk and via associated with . Let denote the coupling length between and as illustrated in Fig. 4 . We define as follows:
where denote the routing layer that contains net . For each net , let denote the Elmore delay [30] at sink . Then, the maximum sink delay of net , denoted , is . The performance of a SOP global routing is to minimize the following cost function:
For each routing interval in a 3-D package with placement layers, let , and , respectively, denote the top pin distribution layer pairs, routing layer pairs, and bottom pin distribution layer pairs. There exists three kinds of connections in each routing interval : top (connection between and ), middle (connection between and ), and bottom (connection between and ). We use the routing layer pairs in , and , respectively, for the top, middle, and bottom connections. We construct the routing grid, denoted , that contains all the distributed pins from and in a 2-D grid. These pins are connected by a graph-based Steiner tree topology generation algorithm (we use an existing heuristic [31] for this purpose). The total layers used in a SOP global routing solution is given by Section V-D discusses how to compute and Section V-E discusses how to compute and . Last, the formal definition of SOP Global Routing Problem is as follows: Given a 3-D placement and netlist, generate a routing topology for each net , assign to a set of routing layers and assign all pins of to legal locations. All conflicting nets are assigned to different routing layers while satisfying various wire/via capacity constraints. The objective is to minimize the following cost function:
We minimize during our layer assignment step and during our channel assignment step.
is the focus during our topology generation step. The wirelength, via, and crosstalk minimization are addressed in all steps of our global router.
D. Overview of the Approach
An overview of our 3-D SOP physical design tool is shown in Fig. 5 . The input to the flow is a module-level netlist that specifies how the SOP modules such as digital die, analog die, embedded passives, opto/MEMS modules, etc, are connected. Our 3-D module placement is based on the Simulated Annealing approach [32] , where we examine many candidate 3-D module placement solutions and evaluate each of them using our thermal and power supply noise analyzers. More details are presented in Section IV. Our 3-D global routing is decomposed into five steps, where the input/output (IO) pins from the blocks are first evenly distributed using pin distribution layers to alleviate congestion problem. We then construct routing tree topology for all nets and assign them to the routing layers to minimize crosstalk. We place the feedthrough-vias for -nets and finish connection in each placement layer. More details are presented in Section V.
IV. SOP PLACEMENT ALGORITHM
A. Overview of the Algorithm
Simulated Annealing is a very popular approach for module placement due to its high quality solutions and flexibility in handling various constraints. We extend the existing 2-D Sequence Pair scheme [33] to represent our 3-D module placement solutions. Simulated Annealing procedure starts with an initial multi-layer placement along with its cost in terms of area, wirelength, decap, and maximum temperature. We then make a random perturbation (move) to the initial solution to generate a new 3-D placement solution and measure its cost. We perform 3-D SSN analysis to compute the amount of decap needed. The algorithm does a one time set-up of the thermal matrices. These matrices are used during incremental temperature calculations to evaluate the thermal cost. The thermal modeling and evaluation is explained in Section IV-B. White space insertion (discussed in Section IV-D) is done and the increase in footprint area is estimated. If the new cost is lower than the old one, the solution is accepted; otherwise the new solution is accepted based on some probability that is dependent on temperature of the annealing schedule. We examine a predetermined number of candidate solutions at each temperature. The temperature is decreased exponentially, and the annealing process terminates when the freezing temperature is reached. The actual decap allocation is done after optimization using LP solver discussed in Section IV-D.
B. 3-D Thermal Analysis
We use a 3-D thermal resistance mesh, as shown in Fig. 6 , for thermal analysis. Each node models a small volume of the 3-D SOP substrate, and each edge denotes the connectivity between two adjacent regions. This is equivalent to using a discrete approximation of the steady state thermal equation , where is thermal conductivity, is temperature, and is power. This results in the matrix equation . Since the matrix computation involved with thermal analysis for each candidate 3-D placement solution is expensive, we tackle this problem with the following scheme. Assuming that the thermal conductivity of the modules are similar, swapping the module locations would not change the thermal resistance matrix too much. This means that matrix only needs to be computed once in the beginning. To calculate the temperature profile of a new block configuration, the power profile needs to be updated and then multiplied by . Alternatively, a change in power profile can be defined. Multiplying and will give change in temperature profile . Adding to the old temperature profile will give the new temperature profile. These equations summarize the two methods:
. Swapping two blocks usually has a small effect on the power profile, so should be sparse. This reduces the number of multiplications used by the second method at the expense of doing extra additions and subtractions.
C. 3-D Power Supply Noise Modeling
We model the P/G network for 3-D SOP as a 3-D grid graph as shown in Fig. 7 . The edges in the grid-graph have inductive and resistive impedances. The mesh contains power-supply points and connection points, which supply and consume currents. The current distribution for the blocks is inversely proportional to the impedance of path from source.
The dominant current source for a block is defined as the voltage source supplying significantly more power to the block than any other neighboring sources. The dominant path for a block is the path from the dominant supply to the block causing the most drop in voltage. It has been shown experimentally in [34] that the shortest path between the dominant current source (nearest Vdd pins) and the block offers highly accurate SSN estimation within reasonable runtime. In our 3-D SSN analysis engine, we compute dominant paths for all blocks which is dynamically updated whenever a new placement solution is evaluated in terms of SSN. Let be a dominant current path for block . Then denotes the set of dominating paths overlapping with ( includes itself). Let be the overlapping segments between path and . Let and denote the resistance and inductance of . The block C draws current from s and s using p ; p , and p (each of these carries I =3 amount of current). The resistance of p , the overlap between p and p , contributes to the SSN at B and C . After the current paths and their values have been determined for all blocks, the SSN for is given by where is the current in the path , which is the sum of all currents through this path to various consumers. An illustration is shown in Fig. 8 . The weight of and its rate of change are the resistive and inductive components of the path. Let denote the maximum charge drawn from the power supply by block . If , where is the noise tolerance, the decap allocated to block is given by where denotes the total number of blocks to be placed. Finally, the decap cost is given by .
D. 3-D Decoupling Capacitor Placement
In our LP-based 3-D decap allocation shown in Fig. 9 , denotes the amount of decap allocated from white space to block . The objective is to maximize the utilization of white spaces for decap allocation. The constraint (??) limits the total allocation of a white space to its total area, where denotes the area of white space , and denotes the neighboring blocks of . The constraint (??) ensures that the total amount of decap allocated to each block does not exceed its demand, where denotes the demand. Unlike the 2-D decap assignment as done in [34] , the neighboring blocks in 3-D case are the adjacent blocks either from the same placement layer or from neighboring layers. The assumption here is that the white space from different layers can be used to allocate decap. In order to facilitate this, we introduce predetermined parameters to control decap allocation to block from white space module .
evaluates the usefulness of whitespace to be used as a decap for block . The value of decreases with increasing distance of whitespace from the module that requires decap. Hence allocations of decap from far located white spaces will be avoided.
The second aspect of the formulation deals with the generation of , the optimal area that will accommodate all decap without too much penalty on the final footprint area. The decap budget for each block is determined through SSN analysis of a particular 3-D placement. The decap allocation involves several iteration between white space insertion and LP solving to reach a good solution both in terms of completeness of decap allocated and the additional increase in area. The white space is generated by expanding the floorplan in the and direction as illustrated in Fig. 10 . In a multilayer placement, however, we have to take additional care to balance the expansion in each layer, so as to minimize the expansion of the total footprint area. The expansion is proportional to the decap demand of each module. In our sequence pair-based 3-D placement, we modify the horizontal and vertical constraint graphs to expand the placement into and directions, respectively.
Note that we may have to iterate between white space insertion and allocation if the current expansion does not satisfy all the decap needs. In order to prevent from iterating between LP and white space insertion, we start by adding white spaces generously in the floorplan then use LP to perform compaction. In our scheme, the -expansion is controlled by parameters discussed earlier, where the existing white spaces have higher ranks than the newly inserted ones. The actual amount of expansion is determined by a parameter 1, where we increase the amount of expansion by a factor of for each block that needs decap. This ensures that the initial expansion satisfies the LP constraint. Since LP determines the individual , we use these assignment values to decide which white space insertion was not necessary and thus can be removed.
As exact decap allocation is a time-consuming process, it is impractical to solve an LP problem for every candidate solution. However, our white space insertion takes , which is computationally equivalent to computing block location using longest path computation for area. Therefore, we can afford white space insertion (and the calculation of area increase due to decap insertion) at every move. We perform the actual LP-based decap allocation at the end of the annealing process as a compaction stage.
V. SOP GLOBAL ROUTING
A. Overview of the Approach
Our 3-D global router is divided into the following five steps. 1) Pin redistribution: we first determine which set of -nets and -net segments are assigned to each routing interval. The pins from these nets are then evenly distributed in the top and bottom pin distribution layers. Our pin redistribution step is further divided into three steps: coarse pin distribution, net distribution, and detailed pin distribution. 2) Topology generation: Steiner trees are generated for all nets in each routing interval so that the performance of the routed design is optimized. 3) Layer assignment: the routed nets are assigned to a unique routing pair in the routing layer so that the total number of layers used is minimized. 4) Channel assignment: for each -net, its location of feedthrough via in the routing channel is determined. We also assign channels and finish the connections for the i-nets that are to be routed in each placement layer. 5) Local routing: we finish connections between the pins now located in the routing channels and the pins on the block boundaries. We also determine the location of pins from soft blocks. We use an existing max-cut partitioning heuristic [35] for the net distribution in step 1. For step 2, we use an existing RSA/G heuristic [31] to generate the net topologies. In addition, we use the congestion-driven rip-up-and-reroute [36] for step 5. Therefore, the focus of this paper is to develop heuristics for the remaining steps in 1, 3, and 4.
B. Coarse Pin Distribution
A pseudocode for our coarse pin distribution algorithm is shown in Fig. 11 . First, we assign all pins in the placement layers to a nearby grid point in CP, a 2-D grid, while trying to balance the number of pins assigned to each grid point (line Fig. 12 . Illustration of coarse pin distribution. Pins along the external boundary are not shown for simplicity. Fig. 13 . Illustration of the gain computation for coarse pin distribution. A net n indicated by the black node is moved from P to P . Then, g (n) = 02;g = 0, and g = 01, where deg(P ) = 3 becomes the maximum-degree partition. Fig. 12 . We impose pin capacity for each grid point so that the pins are evenly distributed in CP. Our approach is to visit the pins in a random order and find the best grid for each pin. For each pin , a grid point that is closest to the original location of and has not violated the pin capacity constraint is chosen (line 3). After this process is finished, CP serves the starting point of our min-cut placement based algorithm, where each grid point corresponds to a partition. We then iteratively improve the quality of this initial solution via move-based approach. In our new heuristic algorithm, our cost function is based on i) how far the new pin location is from the initial location, ii) total wirelength, and iii) how evenly distributed the inter-partition connections are.
1-4). An illustration is shown in
For each pin , we define the displacement gain, denoted , to represent how much the distance between the original and new location is reduced if is moved to another partition. We define the wirelength gain, denoted , to represent how much the length of the nets that contain (estimated by the half-perimeter of the bounding box) is reduced if is moved to another partition. For a partition , let denote the number of nets that have connections to . Then, the cutsize balance factor is defined to be the difference between the maximum and minimum among all partitions. We define the balance gain, denoted , to represent how much the cutsize balance factor is reduced. Our move-based multilevel mincut partitioning algorithm performs cell move based on the combined gain function Fig. 13 shows an illustration of the gain computation. In our multilevel approach, we first perform restricted multilevel clustering (line 5) that preserves the initial placement result, where two pins that are in different partitions initially are not clustered together. At each level of the cluster hierarchy from top to bottom (line 8), we compute the combined gain for each cluster and perform cluster moves. In order to compute the displacement and balance gain of a group of pins ( cluster), we add the individual displacement and balance gain of all pins in this cluster. When there is no gain at a certain level, we decompose the clusters into next lower level and perform refinement. This process continues until we obtain a solution at the bottom level (line 12).
C. Detailed Pin Distribution
After coarse pin distribution and net distribution are finished, we know which set of nets are assigned to each routing interval as well as their (evenly distributed) entry/exit points in pin distribution layers. However, the coarse pin distribution is done based on the 2-D grid that merged all multiple placement layers into one. The even pin distribution in this 2-D grid offers a good enough reference points for net distribution. But, it does not consider even pin distribution in each individual routing interval. In addition, it is also possible that pin capacity for each routing region in each routing interval may be violated. Therefore, the goal of detailed pin distribution is to address these problems in each routing interval so that the subsequent topology generation and layer assignment truly benefit from this even pin distribution. In addition, we use a grid large enough for each routing interval to legalize pin location, i.e., each grid point contains only one pin. Since the crosstalk minimization are addressed during the prior steps, the major focus of detailed pin distribution step is on i) how far the new location is from the original location obtained from coarse pin distribution and ii) the total wirelength.
A pseudocode for our net distribution algorithm is shown in Fig. 14. Our force-directed heuristic algorithm encourages all pins from the same net to be placed closer to the center of mass while minimizing the distance between the old and new pin location. The size of the grids for detailed pin distribution is determined for each routing interval so that each pin can be assigned to a unique grid point (line 1-3 ). In addition, we project the coarse pin distribution result to this new set of grids (line 5). Note that there still exists overlap among the pins in at this point even though DP is usually finer than CP. In order to remove this overlap, we apply an additional force that slightly pulls each pin toward the center of mass. For each pin in a net , the displacement force (line 8) for -direction is defined as follows: where denotes the -coordinate of , the center of mass of , and denotes the width of the bounding box of . We compute using the -coordinates. Note that . The vector is then added to (line 9). This minor change on the original pin location helps to remove most of the overlap in while not increasing the wirelength too much. We then sort the pins based on the lexicographic order of new locations and assign each pin starting from the top-most row in the left-most column (line 10-11). Due to its simplicity, this deterministic algorithm is quite efficient and effective in reducing the additional wirelength required for pin distribution as well as the total wirelength among all nets as shown in Section VI.
D. SOP Layer Assignment
For each routing interval , the routing grid contains all the (redistributed) pins from the top and bottom pin distribution layers ( and ). We generate Steiner-tree based routing topology using the RSA/G heuristic [31] to connect these pins. The goal is to minimize the maximum sink delay as discussed in Section III-C. The routing layer in each routing interval consists of several layer pairs, where each pair consists of one layer for horizontal wires and another layer for vertical wires. Thus, we can assign a entire rectilinear routing tree to a routing pair. In addition, two trees that are intersecting can also be assigned to the same routing pair provided that they do not violated the wire capacity of the routing regions involved. An illustration is shown in Fig. 15 . The SOP layer assignment problem is to assign each net to a routing layer pair so that the wire capacity constraint is satisfied and the total number of layer pairs used for all routing intervals is minimized. Our approach is to perform layer assignment for each routing interval independently.
For each routing interval , we construct a layer constraint graph (LCG) [37] , denoted LCG , as follows: corresponding to each net in we have a node in LCG . Two nodes LCG have an edge between them if a net segment and are sharing the same edge in , i.e., and are sharing the same boundary of a routing region. Then we use a node coloring algorithm to assign colors to the nodes in LCG such that no two nodes sharing an edge are assigned the same color. Let denote the total number of colors used during the node coloring, and let denote the wire capacity of the boundary in routing region. Then, the total number of layer pairs used in this routing interval is computed as follows:
. Last, a node with color is assigned to layer pair . Let and , respectively, denote the maximum number of wires used among all horizontal and vertical edges in . Then, the following is a lower-bound on the number of layers used in : . Fig. 16 shows the SOP layer assignment algorithm that includes our coloring heuristic. We first sort all nodes in LCG in a decreasing order of the number of their neighbors (line 3). Let denote the set of colors used by the neighbors of (line 6). We visit the nodes in the sorted order (line 5). In case there exists an used color that is not included in (line 7), we assign this color to (line 12). In case there exist multiple colors that satisfy this condition, we use the lowest color. Otherwise, we introduce a new color and assign it to (line 9-10). Last, we assign a layer pair to each net based on its color (line 13). In spite of its simplicity, this greedy algorithm provides results that are very close to the lower bound on total number of layers used as demonstrated in Section VI.
E. Sop Channel Assignment
Our prior topology generation and layer assignment steps focus on the connections among the distributed pins in the pin distribution layers. after these steps are finished, there remain two kinds of connections: i) connections between the original and distributed pins and ii) feed-through via insertion. For both types of connections, the routing channels in the placement layers are used. During the SOP channel assignment step, each pin in the pin distribution layer is mapped to a routing channel in the neighboring placement layer. In addition, each pin from an x-net that needs to penetrate a placement layer is also mapped to a routing channel in the placement layer. Last, we generate routing topology for each pin-to-channel connection and assign it to a routing layer pair in the pin distribution layer. Each placement layer is modeled with the floor connection graph (FCG) [29] as illustrated in Fig. 3 . During the SOP local routing step, we use the congestion-driven rip-up-and-reroute [36] to i) finish the routing for the given set of two-pin connections while satisfying the wire capacity constraint and ii) decide the location of pins for soft blocks along their boundaries. An illustration is shown in Fig. 17 .
For each placement layer , let denote the set of pins that need to be mapped to a routing channel in . contains pins from and . The pins in are grouped into two sets: terminal pin set for the pins that have terminals in and feed-through pin set for the pairs of pins that require feed-through vias to penetrate . Let denote the set of routing channels in . Each channel is associated with the pin capacity constraint. The goal of SOP Channel Assignment Problem is to map each pin to a routing channel for 1 and finish -to-connection while satisfying the pin capacity constraint , i.e.,
. For each pair of pins , we map and to the same channel . Let denote the number of layers used to finish connection between pins from and , and let denote the number of layers used to finish connection between pins from and . The objective of SOP channel assignment is to minimize the following cost function:
A pseudocode for SOP channel assignment algorithm is shown in Fig. 18 . We visit each placement layer and assign the feed-through pins and terminal pins to the channels. In addition, an L-shaped routing topology for each pin-to-channel is constructed in a 2-D grid (line 3). The signal delay of feed-through vias is larger than that of other types of vias. Since each channel is under a capacity constraint, it is important to assign the feed-through pins to the nearest channels first. Therefore, our strategy is to perform channel assignment for the feed-through pins first to minimize the delay of -nets that require feed-through vias. In addition, we give priority to the pins that are included in long nets. Thus, we sort the pins based on the wirelength (line 4). Our heuristic algorithm assigns pins to channels based on the cost of mapping-we seek a channel with the best mapping cost for a given pin (line 6-8). We compute the cost of mapping for a given pin and a channel as follows:
where room denotes the number of pins can accommodate until it violates the capacity constraint, dist is the Manhattan distance between and bend is the number of bends in the connection between and , and cong denotes the total number of existing connections along the proposed L-shaped route. We choose the channel with the maximum cost . Note that a channel is represented with a line instead of a point. Thus, distance and bend are based on the shortest connection between and any point on . Upon a pin to channel mapping, we update the usage of channel in and edges in (line 10). After the channel assignment and topology generation for all pin-to-channel connections are finished, we perform layer assignment using our coloring heuristic presented in Section V-D and compute the total number of layers used.
VI. EXPERIMENTAL RESULTS
We implemented our algorithms in C /STL and ran our experiments on Linux Beowulf clusters. We tested our algorithms with two sets of benchmarks. The first set is from the standard GSRC floorplan circuits [38] . The second set, named the GT benchmark, was synthesized from the IBM circuits [39] , where we use our multilevel partitioner [40] to divide the gate-level netlist into multiple blocks. The GSRC benchmarks are small to medium-sized in terms of both the number of blocks and nets. 5 The GT benchmarks contain medium to large number of blocks with dense connections.
A. SOP Placement Results
The dimension of the area was assumed to be in mm and area per unit decap cost was chosen to be 50 mm . The number of placement layers is fixed to four for all experiments. In our experiments, we used the same parameters across the benchmarks. The technology parameters used were 0.01 m for wire resistance per unit length, 1 pH/ m for wire inductance per unit length, and 10 fF/ m for wire capacitance per unit length. We report the average runtime for each benchmark measured in seconds. For our thermal analysis we used a grid size of 10 10 7 in the and direction. The number of active layers was four with three passive layers in between them. The conductivity of the substrate was chosen to be that of silicon (0.1 W/mm C). The conductivity of the sides of the package was fixed at 0.01 W/mm C. The heat sink was assumed to be at the top of the package with a conductivity of 0.5 W/mm C and the conductivity of the bottom of the package was chosen to be 0.2 W/mm C. The power density of the blocks varied from 10 W/mm^2 to 300 W/mm^2 depending on the switching current demands. The switching current demands were generated using a formula using a random number and the size of the block. Table I shows a comparison among 3-D SOP placement algorithms with three different objectives: 1) area/wire-driven ( 1 and 0 in (1), Section III-B), 2) decap-driven ( 1 and 0), and 3) thermal-driven ( 1 and 0). Our baseline algorithm is the area/wirelength-driven algorithm.
• We achieved a 21% improvement over baseline in decap cost using our decap-driven algorithm, with a 7% increase in final area and 15% increase in wirelength. The temperature decreases by 3%.
• Our thermal-driven floorplanning achieved a 21% improvement over baseline with a dramatic increase in total area of 76%. Wirelength increased by 28% and decap requirement increases by 19%. We note that it may be possible to reduce area and wirelength costs by fine-tuning parameters. Table I shows the result of our final 3-D placement algorithm that considers all four objectives: area, wirelength, decap, and thermal ( 1 in (1), Section III-B). We note that our area/wire/decap/thermal-driven algorithm achieves an improvement in both decap and temperature over baseline 5 The GT benchmark circuits are available for download at our website: www.gtcad.gatech.edu. by 13% and 9%, respectively. There is a reasonable increase in final area and wirelength by 19% and 15%, respectively. We also tested our area/wire/decap algorithm under a thermal constraint. In this case, all candidate solutions during the simulated annealing are discarded if the maximum temperature is above a certain threshold (100 C in our case). We observe that the area, wirelength, and decap results improve simultaneously at the cost of slight increase in temperature. The CPU times of the algorithm depend on the number of candidate solutions evaluated. The times reported in the table directly reflect the subset of the configuration space spanned by the algorithm. The constrained algorithm ran much faster, making a small number of moves because fewer moves were accepted. Fig. 19 shows the temperature and decap requirement of each block of the final floorplan of the n300 benchmark. The figure clearly shows that there is little correlation between temperature and decap. This shows that minimizing one objective does not necessarily minimize the other objective as a positive correlation would indicate. It also means that minimizing both objectives is not mutually exclusive, as a negative correlation would indicate. This matches with the block placement results. Table II shows the correlation constants. Table III shows power supply noise simulation results for three 3-D placement schemes-no-decap aware, decap-aware, and decap-aware decap-placement. The P/G plane structure size is 246 mm 254 mm, and the top P/G plane pair was modeled using cavity resonator model [41] and simulated in HSPICE. The placement layer that uses this P/G plan includes 14 active devices. The dc 5 V sources are located at four edges in the plane pair and fourteen current sources exist in the plane pair. As can be seen from the Table III , the SSN for the decap-aware algorithm is lower compared to nondecap-aware algorithm. The SSN of the noisiest block blk4 is 1.58 V, which is reduced to 1.42 V by our decap-aware scheme. With the insertion of decap, the noise is suppressed to 0.22 V. In addition, the total amount of decap required for nondecap-aware algorithm is 26.7, which is reduced to 19.9 with our decap-aware scheme. The largest amount of decap is used for blk5 (0.50 nF), because of an increase in its SSN after optimization. The numbers show that the SSN was efficiently suppressed and the amount of decap reduced by using our algorithms. Table IV shows the characteristics of the GSRC and GT benchmark designs used during routing (see Table V ). In Table VI we compare pin redistribution results. Under the DPD columns, we perform DPD ( detailed pin distribution presented in Section V-C) only, where we skip CPD ( coarse pin distribution presented in Section V-B) and assign all -nets to the routing interval below for net distribution. Under the CPD DPD, we perform CPD using the algorithm, assign all -nets to the routing interval below for net distribution, and perform DPD. DPD serves as our baseline, where CPD+DPD respectively demonstrate the impact of our coarse pin distribution. In all cases, we perform detailed pin distribution to legalize the pin location, i.e., remove overlaps among the pins. We use the following metrics to evaluate our solutions: wirelength between the original and the new location (dw), total wirelength (wl), crosstalk (xt) and total number of layer pairs used (lyr) in the routing layers. The displacement (dw) and wirelength (wl) results are scaled by 10 , and the time reported is the average runtime among the GSRC/GT circuits. From the comparison between DPD and CPD DPD, we note that the displacement result (dw) increases by an average of 50%. However, CPD lowers the total wirelength (wl) consistently by 10% on average and the number of layers (lyr) by 10% on average.
B. SOP Routing Results
In Table VII we show our topology generation (RSA/G) and layer assignment (LAYER) results. We used the technology parameters for 0.13-process for Elmore delay computation. Specifically, the driver resistance of 29.4 k , input capacitance of 0.050 fF, unit-length resistance of 0.82 m and unit-length capacitance of 0.24 fF/ m are used. We report the total wirelength (wl), Elmore delay of the nets with maximum sink delay (dly), and the lower bound and the actual number of layers used for the top-most routing interval. In general, GSRC benchmarks have bigger delay than GT benchmarks due to the larger average wirelength. Our layer assignment algorithm presented in Section V-D is able to achieve results very close to the lower bounds discussed in Section V-D. For the GT circuits, the layer assignment results are within 10% of the lower bound. For the GSRC circuits, we were able to achieve the results equal to the lower bound. Our channel assignment results are shown in Table VIII . The baseline case is where we optimize the wirelength only. We then compare it to our multi-objective channel assignment algorithm that simultaneously optimizes the wirelength, via, and layer usage. Our comparison indicates that the number of layers is consistently and significantly reduced especially for the bigger GT benchmarks, where an average improvement of 48% is observed. In case of the second largest benchmark gt1000, we achieved 57% improvement. This saving on the layer usage comes at the cost of increase in wirelength and vias. The average increase in wirelength is 12% and 30% for GSRC and GT benchmarks, respectively. The average increase in via usage is 63% and 79% for GSRC and GT benchmarks, respectively. We noted that the channel assignment result is very sensitive to the weighting constants among the objectives used in our cost function. This indicates that the solution space of the channel assignment problem offers many useful tradeoff points.
Table IX reports our local routing results. We report the wirelength, maximum and average routing demand as well as the standard deviation. Our baseline is the local routing optimized for wirelength only. We then compare it against our multi-objective local routing algorithm that simultaneously optimizes wirelength and routing demands. In both cases, the same pin demand constraint is imposed. We note that the improvement of our multi-objective algorithm over the baseline is significant, especially for GSRC circuits-the routing demands were reduced by 33% on average while the wirelength increased by only 10%. In addition, we reduced the routing demands for the GT benchmarks by 41% on average, with wirelength increase by 20%. In our biggest benchmarks (gt1500), our routing demand reduction is the largest (53%), which comes with the maximum increase in wirelength (23%). This again indicates that the local routing result is very sensitive to the weighting constants among the objectives used in our cost function. The lower standard deviation of our multi-objective algorithm indicates that the routing demand is more evenly distributed ( lower congestion) compared to the wirelength-only case. Last, Fig. 20 shows a snapshot of a four-layer SOP with 3 routing interval for n200 benchmark.
VII. CONCLUSION
In this article, we presented the first physical layout algorithm for 3-D SOP designs that includes thermal/decap-aware 3-D placement and crosstalk-aware 3-D routing algorithm. We extended the thermal and SSN models to 3-D and used them to guide our 3-D module placement. In addition to estimating the amount of decap needed to keep the SSN at the circuits tolerance level, we also efficiently used near-by white spaces to allocate decap if necessary. Our routing process is divided into pin redistribution, topology generation, layer assignment, channel assignment, and local routing steps. Our major objective is to reduce the amount of crosstalk and layers while satisfying various constraints on routing resource. Our ongoing work includes SOP detailed routing. 
