The complexity of nanometer SoC design requires the codesign and development of circuit design and packaging technology to enable a successful 'total integrated solution'. In this paper we introduce a new area I/O algorithm for the recent flip-chip packaging technology. The algorithm combines a clustering technique with area I/O planning algorithm to avoid iterations during "placement and area I/O pad assignment". Experiment results show that the total interconnect length (including both on-chip and off-chip parts) and delay are reduced by 10-15% comparing with traditional algorithms.
INTRODUCTION
As the semiconductor industry drives into nanometer silicon technologies, the race to keep up with Moore's Law has hit some stumbling blocks. Since the introduction of SoC design methodologies in the mid '90's, silicon and packaging technologies have been pushed hard to support higher performance, lower power, finer geometries and denser I/O solutions. With SoC design methodology, complete systems have been integrated into single chip solutions. This high level of integration has created new demands for packaging technology.
IC packaging technologies with peripheral I/O pads have several short comings. The complexity of the system and the calculated Rent parameters suggest that ICs require asymptotically more pads than the die perimeter can provide [9] . Peripheral I/O pads also constrain clock/power distribution, and their inherently large parasitics cause coupling and power issues for off-chip signaling. Moreover I/O counts have increased from the low hundreds in the early '90's to a few thousands today. For some high end microprocessors the electrical performance and I/O densities can not be easily realized with wire bonding based solutions. Given these concerns, the area I/O regime (flip chip) is predicted to eventually dominate IC implementation methodology. It offers improved pad count and reliability, reduced noise coupling, and cost savings as the technology matures.
Flip chip packaging technology utilizes very small solder spheres, known as solder bumps. These bumps are part of the silicon chip. The silicon chip with bumps is mounted on the package substrate, similar in operation to a surface mount board assembly process. Flip chip technology provides 5-10 times more I/Os than the traditional method of restricting the I/Os to the periphery [1] [7] . Moreover, flip chip interconnect offers lower I/O inductance, better power / ground distribution, and flexibility to connect directly from the package to anywhere on the die. To develop a successful SoC design with flip chip packaging technology, the package must be part of the design cycle from the very beginning. Unfortunately, existing CAD algorithms are for peripheral packaging technologies [2] . In addition, most of them separate the package design from the design cycle and treat the packaging technology as a 'plug and play' component for the silicon chip design.
The Cascade Design Automation recently reported a version of the CAD tool for designing area-array ICs [3] . This tool consists of an area-pad power analyzer, an area pad floor planner and an area pad router. The paper presented by Kiamilev et. al. [4] demonstrated three methods of designing an intrinsic area array IC. The problem with their approach is that the placement and routing of the area pads must be done manually and it has low packing density. The paper by Tan et. al [1] discusses about an area array pad router that automates the placement and routing of the area-array pads on the IC. The problem with this approach is that this is a post processing tool that can be used after initial IC layout generation. It does not take into consideration the packaging and off chip pad placement constraints as well as the "illegal regions "in the IC. The paper by Caldwell et. al [8] deals with an empirical study on the impact of area array I/O on placement. The results show that the use of area array I/O leads to shorter wire lengths and better placement when compared to peripheral I/O placement.
In this paper, we propose a clustering based area I/O pad planning algorithm for flip chip packaging technology. Based on our algorithm we have developed a prototype tool called APT (Area I/O Planning Tool). Our algorithm has two phases: clustering and planning. In the clustering phase, we set the initial cluster area as the pad pitch area (area enclosed by four I/O pads on the die) so that after the clustering procedure, the resulting clusters will have predefined area with bumps surrounding them. In the planning phase, we plan the core logic and the I/O clusters simultaneously so that interconnects both off chip and on chip are minimized, considering predefined area I/O pads in the illegal regions (pre-assigned to power and ground lines).
The algorithm is non iterative and integrates physical design issues along with packaging constraints. It provides a "total integrated solution" and can be used for SoC as well as SIP design methodologies. Experiment results show that the total interconnect length (including both on-chip and off-chip parts) is reduced by 10-15 %, compared with the traditional iterative algorithms where the I/O assignment was based on a predefined placement. It also achieves a speed up of 10-15X over the traditional iterative algorithms.
The rest of the paper is organized as follows. Section 2 presents the problem formulation and preliminaries. Section 3 gives a detailed description of the APT algorithm. Section 4 presents the experimental results conducted on large test circuits and Section 5 concludes the paper with insight into future work. Figure 1 shows the basic structure of a flip chip (FC) package. Flip chip packaging technology utilizes very small solder spheres that are 70µm to 100 µm high and 80 µm to 125 µm wide. The position of the solder bumps are predefined and are arranged in the form of a matrix. The silicon chip with bumps is mounted on the package substrate where a predefined array of substrate pads "touch" these bumps to establish connection.
The inputs to our algorithm are a VHDL/ Verilog netlist (comprises of primary inputs, outputs, IP blocks, and gates), list of IP blocks and a technology file (consists of pad pitch size and matrix of solder bumps with illegal I/O pad regions). The core objective is to place and route these components (taking into account the 'illegal regions') along with off chip and on chip I/O pads so that we get an optimized total wiring length (includes both on-chip and off-chip interconnects) and delay.
As the feature size continues to shrink with the advent of SoC and other design methodologies, the circuit size becomes larger and increasingly difficult to handle. By pre-processing the netlist and creating a clustered netlist, the problem size becomes more manageable. The other reason for clustering is that in a design with widely varying cell sizes, the clustering step is used to create clusters of roughly equivalent size thereby enabling the use of cell-oriented algorithms on the clustered netlist. In the clustering phase, we set the initial cluster area as the pad pitch area (area between four I/O pads on the die) so that after the clustering procedure, the resulting clusters will have predefined area (in terms of pad pitch) with I/O pad connections (Solder bumps) adjoining them.
In the planning phase, we use a heuristic algorithm to plan the core logic and the I/O clusters simultaneously so that interconnects, both off chip and on chip are minimized, considering predefined illegal regions (pre-assigned to power and ground lines) of area I/O pads. Both the clustering and planning stages are emphasized in detail in the APT algorithm in section 3. The first part of the APT algorithm deals with the clustering of the input netlist.
Figure1. Area array pad structure
APT (AREA I/O PLANNING TOOL) 3.1. Clustering Phase
Given a netlist comprising of primary inputs, outputs, IP blocks, and gates, the clustering problem is to decompose the given components in the netlist into a number of clusters. It is a preprocessing step that is important in the sense that it not only reduces the size and complexity of the circuit but also maintains the natural hierarchy of the circuit that is clustered.
The clustering procedure recursively "collapses" small cliques to form clusters that satisfy the area and size requirements. Our clustering procedure follows a bottom up procedure and is similar to the one in [5] . The difference is that we use pad pitch (P: distance of separation between the pads) to define the area of clusters, so that the resultant clusters will have predefined area (in terms of pad pitch) with solder bumps in their periphery.
First, we convert the given VHDL/VERILOG netlist to a weighted directed graph for an n terminal net [6] . The weights correspond to the adjacency between the nodes. We introduce node replication if a node is communal, i.e. if it is linked to more than one component in the netlist.
The algorithm uses a heuristic that selects a particular node and forms a clique with its neighboring nodes. Let 
The nodes in the clique are collapsed to form a cluster. The weight of the clustered node W(C) will be the sum of weight of the individual nodes in the clique. The edges that are internal to the clique are removed. For any node v outside the cluster, all edges that connect v to nodes inside the cluster are bundled together to form a new edge which connects the node v to the newly formed cluster node. The weight of the resultant edge is the sum of the weights of the edges that are bundled together. The i MAXCl array is cleared and the process is repeated. The clustering procedure ends when there are insufficient nodes to form clusters.
Once the clustering process is completed, we end up with a set of clustered nodes and edges with node weights corresponding to the area of clustered nodes and edge weights to the adjacency between the clusters. The edge weights form the elements of the adjacency matrix (A). If two clustered nodes are not adjacent the corresponding element in the adjacency matrix is taken as zero. The clustered nodes along with adjacency matrix (A) and the list of IP blocks are given as the input to the planning part of the APT algorithm.
Planning Phase
The planning phase deals with the optimal assignment of the clusters in the chip area. The objective of the planning is to assign the obtained clusters such that Primary Inputs and Primary Outputs are assigned to legal pad sites and the overall wiring length is minimized.
Terminology
In this section we first introduce some notations. The generated clusters are represented as a new graph G(C,W) where C represents the clusters and W represents the weight/area of the clusters. The chip's dimensions are determined by the total weight of the clusters in graph G. The sum of the weights is approximated to the least possible square which gives the area of the chip. IP(C) is defined as a subset of G consisting of IPblock clusters. The cluster with the largest area and does not belong to IP(C) is defined as the primary cluster (CL). All the remaining clusters are defined as non-primary clusters. IPmax is defined as the largest IP-block cluster in IP(C). The weight of primary cluster is approximated to WL, an integer multiple of the pad pitch area (P 2 ) such that it gives the best rectangular fit. SL represents the semi perimeter of the rectangular area (WL). Assume that the number of generated clusters is n. A is an adjacency matrix of G with dimension n x n. A(i, j) is an element of matrix A that gives the measure of adjacency between the clusters i and j. Pil and Ple represent the illegal and legal pad sites respectively. PI and PO are the primary inputs and primary outputs respectively. 
It gives the number of illegal pad sites covered by the primary cluster (CL) and the number of legal pad sites (Ple) available around CL for a specified location on the chip. X denotes the number of primary inputs and primary outputs adjacent to CL. R(C) is defined as the set of assigned clusters that are taken as reference for further planning. CR is used to represent a cluster of R(C).
Figure.2 Formation of S(C) and S(C)
is defined as the set of clusters which are either adjacent to the clusters in R(C) or shares a primary input (PI) with clusters in R(C).
is defined as a subset of primary inputs which are adjacent to the clusters in R(C). Figure 2 depicts how S(C) and are formed. In Figure 2 , C1, C2, C3, C5 and C6 form S(C). C2, C5 and C6 are included in S(C) because they are adjacent to cluster set R(C). C1 and C3 are included in S(C) since they are connected to R(C) through a primary input.
As(C) gives the measure of the sparsity of the adjacency matrix formed by clusters in S(C). The sparsity of a matrix is defined as the ratio of zero entries and the total number of entries in a matrix. If any element in the adjacency matrix is zero, it indicates that the clusters corresponding to the row and column are not adjacent to one another. A connectivity matrix (CM) is formed using the clusters of S(C). In Figure 2 , C1 and C2 belonging to S(C) are connected through an unassigned external cluster C7. The element in the CM corresponding to these clusters is weighed by the total edge weights between these clusters. C2 and C3 are not connected to one another through any unassigned external cluster. Hence the matrix element corresponding to these clusters is zero. Cs(C) gives the sparsity measure of the connectivity matrix (CM).
LP(S(C)) gives the linear placement of the clusters. The Primary Input Net Span (PNS)
gives the total interconnect length required to wire the clusters in S(C) with primary inputs in . This is given by 
Planning is done in stages and clusters planned in each stage replace the clusters in R(C). The IP block clusters are decomposed into the minimum possible number of rectangular blocks as shown in Figure 3 . When an IP block is assigned, care is taken such that these blocks are regrouped and placed together. This method preserves the geometrical shape of the IP block during the course of planning.
Figure.3 Decomposition of an L-block

Problem Formulation [
The planning phase begins with the assignment of the primary cluster (CL). CL is assigned to a suitable location determined by the number of I/Os adjacent to it and the relative size of IPmax.
R(C) is updated by CL. S(C) and are determined for the clusters in R(C). The clusters in S(C) are linearly ordered and assigned. Clusters in and the primary outputs adjacent to the S(C) and R(C) are assigned to the nearest available pad sites. R(C) is cleared and replaced by clusters in S(C).
The planning process is repeated until all the clusters in G(C, W) are assigned. We summarize our planning phase in the following steps.
Step 1: Determine the largest non-IP block cluster (CL) from the graph and fit it into the best possible rectangle of area (WL) with semi-perimeter (SL ). The area of the primary cluster is given by W(CL). W(CL) is approximated to WL , the closest possible rectangular fit. Equations (4) to (8) describe the approximation.
The length (y) of WL is given by the following relation.
where y is a positive integer. The breadth (x) of WL is given by the following relation.
where x is a positive integer.
Step 2: For the given graph (G), we generate several possible locations for the primary cluster (CL).
The number of primary inputs and outputs adjacent to the primary cluster (X) is compared with SL. SL gives a measure of the number of pad sites available on the cluster periphery. It is observed that a smaller value of X is associated with a smaller number of non-primary clusters adjacent to CL. Confining CL to a corner in this case results in the availability of maximum number of legal pad sites around the assigned clusters as the planning progresses. Thus CL is confined to one of the corners if X is less than SL; otherwise CL is located at a suitable position in the center of the chip. If WL is comparable to IPmax, the following procedure is adapted. A suitable corner location is chosen for IPmax using the coverage function COV(Pil, W(IPmax)) such that the maximum number of illegal pad sites (Pil ) is covered. The primary cluster is confined to the corner diagonally opposite to the location of IPmax. This is done to provide maximum number of pad locations in the periphery of the primary cluster.
Step 3: From all possible locations, we find the most optimal location for CL. The location of CL has a great impact on the final solution since most of the primary I/Os and non-primary clusters are connected to CL. If CL is confined to a corner, we expand the possible corner locations by shifting CL by one block (P 2 ) in all possible directions. We expand the possible center locations in the same way. The coverage function COV (Pil, WL) is used to refine these locations to obtain the most optimal location. The location which offers the maximum coverage and the maximum number of legal pad sites (Ple) around CL is chosen as the best location for the primary cluster. Once CL is assigned, it is taken as a reference cluster in R(C).
Step 4: For the given non-primary clusters and their adjacency matrix, assign them to suitable locations around clusters in R(C) such that the total wiring length is minimized. S(C) and are determined as shown in Figure 2 for the given cluster set R(C). The connectivity matrix is derived from the adjacency matrix using S(C). As(C) and Cs(C) are computed from the adjacency and connectivity matrices respectively. The clusters in S(C) are ordered so as to minimize the interconnect length required to wire the clusters in S(C) and with those in R(C). The ordering of the clusters is done by linear placement LP(S(C)) taking Primary Input Net Span (PNS), As(C) and Cs(C) into consideration. The following paragraph describes the significance of Primary Input Net Span (PNS), As(C) and Cs(C) in the linear ordering of the clusters.
A primary input (PI) is usually connected to more than one cluster. Placing clusters which share a primary input away from one another results in larger interconnect length. If the sparsity measure of the adjacency matrix (As(C)) is less than 0.7, failure to take the adjacency aspect into consideration increases the interconnect length significantly. A smaller value of As(C) implies that the numbers of non-zero entries in the adjacency matrix is large. A large number of non-zero entries indicate that most of the clusters are adjacent to one another. In this case, if the adjacent clusters are placed away from one another; the total interconnect requirement will increase significantly. The value of 0.7 has been chosen after experimental verification of several test cases. Similarly, if Cs(C) is less than 0.7, the additional wiring length required on account of placing two connected clusters apart from one another cannot be ignored.
An alternative means of ordering the clusters is taking delay between the clusters into consideration. The delay between the clusters can be computed using the following equations. 
The influence of ordering the clusters on the interconnect length is explained below.
Ordering by Primary Input Net Span (PNS)
For the sake of simplification, R(C) is assumed to have only one reference cluster i.e. CR. In Figure 4 , C2 and C4 are connected to CR through the primary input, PI2. Since C2 and C4 are not assigned adjacent locations, the interconnect length required to Figure 4 . Unordered Clusters wire these clusters is large. Similarly, the nets corresponding to PI3 and PI4 also contribute to additional interconnect length. In Figure 5 , the clusters are ordered such that clusters sharing a primary input are assigned adjacent locations. The interconnect length requirement is observed to be reduced significantly Ordering by Adjacency 
Ordering by Connectivity through an external cluster
In Figure 8 , C1, C2, C3, C4, C5 and C6 are the clusters of S(C). C7, C8 and C9 do not belong to S(C). Though the clusters of S(C) are not adjacent, they are connected to one another through external clusters (C7, C8, and C9). In Figure 8 . C1 and C4 are connected through C8. Assigning C1 and C4 away from one another required larger interconnect length to wire them with C8. This problem is overcome by placing C1 and C4 closer. Figure 9 , shows the clusters ordered on the basis of connectivity through an external cluster.
Step 5: For a given ordering of LP(S(C)), assign clusters in S(C), their corresponding primary inputs and primary outputs such that optimal placement is achieved.
LP(S(C))
gives the ordering of the clusters but do not provide any information about their absolute locations. The ordered list of S(C) is partitioned into several subsets such that no clusters in any two subsets are either adjacent or connected through an Figure 8 . Unordered Clusters Figure 9 . Ordered Clusters (Connectivity approach) external cluster. The obtained subsets of clusters are assigned closer to the clusters in R(C). The advantage of the partitioning is that the clusters can be assigned in batches. This partitioning is very effective when the periphery of CR is marked by illegal pad sites (Pil). After all the clusters of S(C) have been assigned, the clusters in are assigned to the available legal pad sites (Ple) in the proximity of their corresponding S(C) and R(C) clusters. The primary outputs dedicated to the clusters of S(C) and R(C) are determined from the adjacency matrix (A). The obtained primary outputs are assigned to the closest available legal pad sites (Ple). The clusters in R(C) are replaced by clusters of S(C).
Step 6: Plan the remaining unassigned clusters connected to clusters in R(C). S(C) and are derived as described earlier in Figure 2 for the modified cluster set R(C). The clusters in S(C) are ordered as in step 4.
Step 5 is repeated to plan the unassigned clusters. The process is repeated until all the clusters have been planned.
Delay Computation
The critical path delay is calculated from the graph G(C,W). It is the sum of the delay contributed by the clusters in the path along with the on chip and off chip interconnects. As the nodes inside a cluster are strongly connected, the interconnect delay within a cluster is negligible and the nodes contribute to the cluster delay. The wiring delay is computed from equations (9) and (10) taking CH = CV = 2.5pF/cm and Rwire = 1.
EXPERIMENTAL RESULTS
We have implemented the APT algorithm in C programming language. The experiments were run on several test circuits and the results are tabulated in Table 1 . The circuits C499, C880, C2670 and C7552 are from the ISCAS 85/89 Bench mark suite, while the other circuits are synthesized from them. We have simulated the results using a SUN Ultra2 workstation with 512MB memory. The pad pitch (P) is taken as 100µm for solder bump based flip chip technology. The Figure 10 gives the resultant clusters and their orientation for the test circuit 3. It also provides the plot of the legal I/O pads that were used for interconnection with the clustered nodes. The results guarantee an optimal placement plan with reduction in interconnect length (including both on-chip and off-chip parts) and delay by 10-15% over the existing iterative approach. The resultant delays are computed using equations (9) and (10) for the APT algorithm. gives the execution time of APT algorithm. We also achieve a better speed up of over 10-15X over the iterative placement approach. The iterative approach that is used for comparison deals with an initial assignment of the clusters inside the flip chip area. It plans them iteratively to get an optimal reduction in interconnect length and delay. The solution depends on the initial location. The execution time is the time taken by the iterative algorithm to converge. 
CONCLUSION
The APT (Area I/O Planning Tool) algorithm provides a "total integrated solution" for placing the clustered nodes along with the I/O pads in an area array based system. It provides a better non iterative planning approach for the SoC and SiP design methodologies. Experimental results on large test circuit's shows hat as a result of using the APT, the total interconnect length (including both on chip and off chip parts) and circuit delay are significantly reduced by 10-15%. It also achieves a speed up of over 10-15X over the iterative method. In the future we plan to refine our algorithm for incorporating the area I/O buffer planning.
IL(mm)
Delay ( 
